I have a dataframe containing 15K+ strings in the format of xxxx-yyyyy-zzz. The yyyyy is a random 5 digit number generated. Given that I have xxxx as 1000 and zzz as 200, how can I generate the random yyyyy and add it to the dataframe so that the string is unique?
number
0 1000-12345-100
1 1000-82045-200
2 1000-93035-200
import pandas as pd
data = {"number": ["1000-12345-100", "1000-82045-200", "1000-93035-200"]}
df = pd.DataFrame(data)
print(df)
I'd generate a new column with just the middle values and generate random numbers until you find one that's not in the column.
from random import randint
df["excl"] = df.number.apply(lambda x:int(x.split("-")[1]))
num = randint(10000, 99999)
while num in df.excl.values:
num = randint(10000, 99999)
I tried to come up with a generic approach, you can use this for lists:
import random
number_series = ["1000-12345-100", "1000-82045-200", "1000-93035-200"]
def rnd_nums(n_numbers: int, number_series: list, max_length: int=5, prefix: int=1000, suffix: int=100):
# ignore following numbers
blacklist = [int(x.split('-')[1]) for x in number_series]
# define space with allowed numbers
rng = range(0, 10**max_length)
# get unique sample of length "n_numbers"
lst = random.sample([i for i in rng if i not in blacklist], n_numbers)
# return sample as string with pre- and suffix
return ['{}-{:05d}-{}'.format(prefix, mid, suffix) for mid in lst]
rnd_nums(5, number_series)
Out[69]:
['1000-79396-100',
'1000-30032-100',
'1000-09188-100',
'1000-18726-100',
'1000-12139-100']
Or use it to generate new rows in a dataframe Dataframe:
import pandas as pd
data = {"number": ["1000-12345-100", "1000-82045-200", "1000-93035-200"]}
df = pd.DataFrame(data)
print(df)
df.append(pd.DataFrame({'number': rnd_nums(5, number_series)}), ignore_index=True)
Out[72]:
number
0 1000-12345-100
1 1000-82045-200
2 1000-93035-200
3 1000-00439-100
4 1000-36284-100
5 1000-64592-100
6 1000-50471-100
7 1000-02005-100
In addition to the other suggestions, you could also write a function that takes your df and the amount of new numbers you would like to add as arguments, appends it with the new numbers and returns the updated df. The function could look like this:
import pandas as pd
import random
def add_number(df, num):
lst = []
for n in df["number"]:
n = n.split("-")[1]
lst.append(int(n))
for i in range(num):
check = False
while check == False:
new_number = random.randint(10000, 99999)
if new_number not in lst:
lst.append(new_number)
l = len(df["number"])
df.at[l+1,"number"] = "1000-%i-200" % new_number
check = True
df = df.reset_index(drop=True)
return df
This would have the advantage that you could use the function every time you want to add new numbers.
try:
import random
df['number'] = [f"1000-{x}-200" for x in random.sample(range(10000, 99999), len(df))]
output:
number
0 1000-24744-200
1 1000-28991-200
2 1000-98322-200
...
One option is to use sample from the random module:
import random
num_digits = 5
col_length = 15000
rand_nums = random.sample(range(10**num_digits),col_length)
data["number"]=['-'.join(
'1000',str(num).zfill(num_digits),'200')
for num in rand_nums]
It took my computer about 30 ms to generate the numbers. For numbers with more digits, it may become infeasible.
Another option is to just take sequential integers, then encrypt them. This will result in a sequence in which each element is unique. They will be pseudo-random, rather than truly random, but then Python's random module is producing pseudo-random numbers as well.
Related
I have a large list of around 200 values
The list looks like this
list_ids = [10148,
10149,
10150,
10151,
10152,
10153,
10154,
10155,
10156,
10157,
10158,
10159,
10160,
10161,
10163,
10164,
10165,
10167,
10168,
10169,
10170,
10171,
10172,
10173,
10174,
10175,
10177,
10178,
10179,
10180,
10181,
10182,
10183,
7137,
7138,
7139,
7142,
7143,
7148,
7150,
7151,
7152,
7153,
7155,
7156,
7157,
9086,
9087,
9088,
9089,
9090,
9091,
9094,
9095,
9096,
9097,
2164]
I would like to shuffle this list and create a sublist of 19 values for each sublist.
I tried :
list_ids.sort(key=lambda list_ids, r={b: random.random() for a, b in list_ids}: r[list_ids[1]])
But it didnt work. Looks like I am missing something.
End result is a sublist with shuffled values containing 19 values each
you can shuffle the list with random.shuffle:
import random
# shuffles list in place
random.shuffle(list_ids)
#split into lists containg 19 elements
splits = list([list_ids[i:i+19] for i in range(0,len(list_ids),19)])
import random
s = 19
random.shuffle(list_ids)
sub_lists = [list_ids[s*i:s*(i+1)] for i in range(len(list_ids) // s)]
Convert to pandas series and get a sample of size 19:
import pandas as pd
ids = pd.Series(list_ids)
ids.sample(19).values
for random numbers between 0 and 1:
import random
random.shuffle(list_ids)
result = {}
for i in list_ids:
result[i] = [random.random() for x in range(19)]
result
for random numbers from the original list:
import random
random.shuffle(list_ids)
result = {}
for i in list_ids:
result[i] = [ids.sample(19).values]
result
I want to construct a list of 100 randomly generated share prices by generate 100 random 4-letter-names for the companies and a corresponding random share price.
So far, I have written the following code which provides a random 4-letter company name:
import string
import random
def stock_generator():
return ''.join(random.choices(string.ascii_uppercase, k=4))
stock_name_generator()
# OUTPUT
'FIQG'
But, I want to generate 100 of these with accompanying random share prices. It is possible to do this while keeping the list the same once it's created (i.e. using a seed of some sort)?
I think this approach will work for your task.
import string
import random
random.seed(0)
def stock_generator():
return (''.join(random.choices(string.ascii_uppercase, k=4)), random.random())
parse_result =[]
n=100
for i in range(0,n):
parse_result.append(stock_generator())
print(parse_result)
import string
import random
random.seed(0)
def stock_generator(n=100):
return [(''.join(random.choices(string.ascii_uppercase, k=4)), random.random()) for _ in range(n)]
stocks = stock_generator()
print(stocks)
You can generate as many random stocks as you want with this generator expression. stock_generator() returns a list of tuples of random 4-letter names and a random number between 0 and 1. I image a real price would look different, but this is how you'd start.
random.seed() lets you replicate the random generation.
Edit: average stock price as additionally requested
average_price = sum(stock[1] for stock in stocks) / len(stocks)
stocks[i][1] can be used to access the price in the name/price tuple.
You can generate a consistent n random samples by updating the seed of random before shuffle. Here is an example on how to generate these list of names (10 samples):
import random, copy
sorted_list = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
i = 0 #counter
n = 10 # number of samples
l = 4 # length of samples
myset = set()
while i < n:
shuffled_list = copy.deepcopy(sorted_list)
random.seed(i)
random.shuffle(shuffled_list)
name = tuple(shuffled_list[:l])
if name not in myset:
myset.add(name)
i+=1
print(sorted([''.join(list(x)) for x in myset]))
# ['CKEM', 'DURP', 'GQXO', 'JFWI', 'JNRX', 'MNSV', 'OAXS', 'TIFX', 'VLZS', 'XYLK']
Then you can randomly generate n number of prices and create a list of tuples that binds each name to a price:
names = sorted([''.join(list(x)) for x in myset])
int_list = random.sample(range(1, 100), n)
prices = [x/10 for x in int_list]
names_and_prices = []
for name, price in zip(names,prices):
names_and_prices.append((name,price))
# [('CKEM', 1.5), ('DURP', 1.7), ('GQXO', 6.5), ('JFWI', 7.6), ('JNRX', 0.9), ('MNSV', 8.9), ('OAXS', 5.0), ('TIFX', 9.6), ('VLZS', 1.4), ('XYLK', 3.8)]
Try:
import string
import random
def stock_generator(num):
names = []
for n in range(num):
x = ''.join(random.choices(string.ascii_uppercase, k=4))
names.append(x)
return names
print(stock_generator(100))
Each time you will call the function stock_generator using a number of your choice as parameter, you'll generate the stock name you need.
I am am trying to round numbers in a dataframe that has lists as values for each row. I need whole numbers to have no decimal and floats to only have two places after the decimal. There is an unknown number of values for each list (some lists have 2 values, some have 4 or 5 or more). Here is what I have:
df = pd.DataFrame({"A": [[16.0, 24.4175], [14.9687, 16.06], [22.75, 23.00]]})
def remove_exponent(num):
return num.to_integral() if num == num.to_integral() else num.normalize()
def round_string_float(x):
try:
return remove_exponent(Decimal(x).quantize(TWOPLACES))
except:
return x
df['A']=df['A'].apply(lambda x: [round_string_float(num) for num in x])
But this gives me: [Decimal('16'), Decimal('24.42')]
Here is what I am trying:
def round(num):
if str(numbers).find('/') > -1:
nom, den = numbers.split(',')
number=round_string_float(nom)
second=round_string_float(den)
return f'[{number}, {second}]'
but there has to be an easier way to do this
Here is what I want:
df = pd.DataFrame({"A": [[16, 24.42], [14.97, 16.06], [22.75, 23]]})
I would like to know have to use **args to do this but really anything that works would be good
Have you tried a for loop. For example
list = []
for i in range(len(df)):
for j in range(len(df[i])):
list .append(round(df[i][j]))
That's a weird format for a DataFrame, but if you want it you can do something like this:
import pandas as pd
df = pd.DataFrame({"A": [[16.0, 24.4175], [14.9687, 16.06], [22.75, 23.00]]})
print(df.applymap(lambda x: [round(v, None if v.is_integer() else 2) for v in x]))
Given that
The return value [of round] is an integer if ndigits is omitted or None.
this evaluates, for each nested number v, round(v) if v is an integer else round(v, 2).
This outputs
A
0 [16, 24.42]
1 [14.97, 16.06]
2 [22.75, 23]
I created an answer to this question that goes above and beyond what I wanted but I think it will help anyone looking for something similar. The problem with my company is we have to upload lists as values in a dataframe to the database. This is why the code is so ad-hoc:
from decimal import *
TWOPLACES = Decimal(10) ** -2
from natsort import natsorted
import ast
from fractions import Fraction
#----------------------------------------------------------------
# remove_exponent and round string float are designed to round whole numbers 16.00 to 16, and rounds numbers with 3 or more decimals to 2 decimals 16.254 to 16.25
def remove_exponent(num):
return num.to_integral() if num == num.to_integral() else num.normalize()
def round_string_float(x):
try:
return remove_exponent(Decimal(x).quantize(TWOPLACES))
except:
return x
#------------------------------------------------------------------------------
# frac2string converts fractions to decimals: 1 1/2 to 1.5
def frac2string(s):
i, f = s.groups(0)
f = round_string_float(Fraction(f))
return str(int(i) + round_string_float(float(f)))
#------------------------------------------
#remove duplicates is self explanitory
def remove_duplicates(A):
[A.pop(count) for count,elem in enumerate(A) if A.count(elem)!=1]
return A
# converts fractions and rounds numbers
df['matches'] = df['matches'].apply(lambda x:[re.sub(r'(?:(\d+)[-\s])?(\d+/\d+)', frac2string, x)])
# removes duplicates( this needs to be in the format ["\d","\d"]
df['matches'] = df['matches'].apply(lambda x: remove_duplicates([n.strip() for n in ast.literal_eval(x)]))
The below code takes 2 seconds to finish.
The code looks clean but is very inefficient.
I am trying to pre-generate the ways you can build up to a total of max_units in increments of 2.
I'd then filter the created table to where secondary_categories meet certain criteria:
'A' is >10% of the total and 'B'<=50% of the total.
Do you see a better way to get the combinations in increments of 2 that meet criteria like the above?
import itertools
import pandas as pd
primary_types= ['I','II']
secondary_categories= ['A','B']
unitcategories= len(primary_types)*len(secondary_categories) #up to 8
min_units= 108; max_units= 110 #between 20 and 400
max_of_one_type= max_units
args =[[i for i in range(2,max_of_one_type, 2)] for x in range(unitcategories)]
lista= list(itertools.product(*args))
filt= [True if max_units>=l>=min_units else False for l in list(map(sum, lista))]
lista= list(itertools.compress(lista, filt))
df=pd.DataFrame(lista, columns= pd.MultiIndex.from_product([primary_types, secondary_categories], names=['', '']))
df['Total']=df.sum(axis=1)
df
Extending the following makes it take significantly longer or run out of memory: primary_types, secondary_categories, min_units, max_units.
Thank you
OK so I'm posting this just FYI but I don't think it's an ideal solution. I believe there exists a far more elegant solution and I bet it involves numpy. However, this should at least be faster than the OP:
import itertools
import pandas as pd
primary_types = ["I", "II"]
secondary_categories = ["A", "B"]
unitcategories = len(primary_types) * len(secondary_categories) # up to 8
min_units = 54
max_units = 55 # between 10 and 200
max_of_one_type = max_units
args = [range(1, max_of_one_type) for x in range(unitcategories)]
lista = [x for x in itertools.product(*args)if max_units >= sum(x) >= min_units]
df = pd.DataFrame(
lista,
columns=pd.MultiIndex.from_product(
[primary_types, secondary_categories], names=["", ""]
),
)
df["Total"] = df.sum(axis=1)
df = df * 2 # multiply by 2 to get the result you want
I divided everything by 2 at the start and multiplied the result at the end by 2.
I removed all unnecessary uses of list
I removed the itertools.compress and filt and instead just put an if in the list comprehension (where lista is declared and assigned)
I need to split dataframe into 10 parts then use one part as the testset and remaining 9 (merged to use as training set) , I have come up to the following code where I am able to split the dataset , and m trying to merge the remaining sets after picking one of those 10.
The first iteration goes fine , but I get following error in second iteration.
df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10)))
for x in range(3):
dfList = np.array_split(df, 3)
testdf = dfList[x]
dfList.remove(dfList[x])
print testdf
traindf = pd.concat(dfList)
print traindf
print "================================================"
I don't think you have to split the dataframe in 10 but just in 2.
I use this code for splitting a dataframe in training set and validation set:
test_index = np.random.choice(df.index, int(len(df.index)/10), replace=False)
test_df = df.loc[test_index]
train_df = df.loc[~df.index.isin(test_index)]
okay I got it working this way :
df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10)))
dfList = np.array_split(df, 3)
for x in range(3):
trainList = []
for y in range(3):
if y == x :
testdf = dfList[y]
else:
trainList.append(dfList[y])
traindf = pd.concat(trainList)
print testdf
print traindf
print "================================================"
But better approach is welcome.
You can use the permutation function from numpy.random
import numpy as np
import pandas as pd
import math as mt
l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
df = pd.DataFrame({'a': l, 'b': l})
shuffle the dataframe index
shuffled_idx = np.random.permutation(df.index)
divide the shuffled_index into N equal(ish) parts
for this example, let N = 4
N = 4
n = len(shuffled_idx) / N
parts = []
for j in range(N):
parts.append(shuffled_idx[mt.ceil(j*n): mt.ceil(j*n+n)])
# to show each shuffled part of the data frame
for k in parts:
print(df.iloc[k])
I wrote a piece of script find / fork it on github for the purpose of splitting a Pandas dataframe randomly. Here's a link to Pandas - Merge, join, and concatenate functionality!
Same code for your reference:
import pandas as pd
import numpy as np
from xlwings import Sheet, Range, Workbook
#path to file
df = pd.read_excel(r"//PATH TO FILE//")
df.columns = [c.replace(' ',"_") for c in df.columns]
x = df.columns[0].encode("utf-8")
#number of parts the data frame or the list needs to be split into
n = 7
seq = list(df[x])
np.random.shuffle(seq)
lists1 = [seq[i:i+n] for i in range(0, len(seq), n)]
listsdf = pd.DataFrame(lists1).reset_index()
dataframesDict = dict()
# calling xlwings workbook function
Workbook()
for i in range(0,n):
if Sheet.count() < n:
Sheet.add()
doubles[i] =
df.loc[df.Column_Name.isin(list(listsdf[listsdf.columns[i+1]]))]
Range(i,"A1").value = doubles[i]
Looks like you are trying to do a k-fold type thing, rather than a one-off. This code should help. You may also find the SKLearn k-fold functionality works in your case, that's also worth checking out.
# Split dataframe by rows into n roughly equal portions and return list of
# them.
def splitDf(df, n) :
splitPoints = list(map( lambda x: int(x*len(df)/n), (list(range(1,n)))))
splits = list(np.split(df.sample(frac=1), splitPoints))
return splits
# Take splits from splitDf, and return into test set (splits[index]) and training set (the rest)
def makeTrainAndTest(splits, index) :
# index is zero based, so range 0-9 for 10 fold split
test = splits[index]
leftLst = splits[:index]
rightLst = splits[index+1:]
train = pd.concat(leftLst+rightLst)
return train, test
You can then use these functions to make the folds
df = <my_total_data>
n = 10
splits = splitDf(df, n)
trainTest = []
for i in range(0,n) :
trainTest.append(makeTrainAndTest(splits, i))
# Get test set 2
test2 = trainTest[2][1].shape
# Get training set zero
train0 = trainTest[0][0]