I have a large list of around 200 values
The list looks like this
list_ids = [10148,
10149,
10150,
10151,
10152,
10153,
10154,
10155,
10156,
10157,
10158,
10159,
10160,
10161,
10163,
10164,
10165,
10167,
10168,
10169,
10170,
10171,
10172,
10173,
10174,
10175,
10177,
10178,
10179,
10180,
10181,
10182,
10183,
7137,
7138,
7139,
7142,
7143,
7148,
7150,
7151,
7152,
7153,
7155,
7156,
7157,
9086,
9087,
9088,
9089,
9090,
9091,
9094,
9095,
9096,
9097,
2164]
I would like to shuffle this list and create a sublist of 19 values for each sublist.
I tried :
list_ids.sort(key=lambda list_ids, r={b: random.random() for a, b in list_ids}: r[list_ids[1]])
But it didnt work. Looks like I am missing something.
End result is a sublist with shuffled values containing 19 values each
you can shuffle the list with random.shuffle:
import random
# shuffles list in place
random.shuffle(list_ids)
#split into lists containg 19 elements
splits = list([list_ids[i:i+19] for i in range(0,len(list_ids),19)])
import random
s = 19
random.shuffle(list_ids)
sub_lists = [list_ids[s*i:s*(i+1)] for i in range(len(list_ids) // s)]
Convert to pandas series and get a sample of size 19:
import pandas as pd
ids = pd.Series(list_ids)
ids.sample(19).values
for random numbers between 0 and 1:
import random
random.shuffle(list_ids)
result = {}
for i in list_ids:
result[i] = [random.random() for x in range(19)]
result
for random numbers from the original list:
import random
random.shuffle(list_ids)
result = {}
for i in list_ids:
result[i] = [ids.sample(19).values]
result
Related
I currently have the numbers above in a list. How would you go about adding similar numbers (by nearest 850) and finding average to make the list smaller.
For example I have the list
l = [2000,2200,5000,2350]
In this list, i want to find numbers that are similar by n+500
So I want all the numbers similar by n+500 which are 2000,2200,2350 to be added and divided by the amount there which is 3 to find the mean. This will then replace the three numbers added. so the list will now be l = [2183,5000]
As the image above shows the numbers in the list. Here I would like the numbers close by n+850 to all be selected and the mean to be found
It seems that you look for a clustering algorithm - something like K-means.
This algorithm is implemented in scikit-learn package
After you find your K means, you can count how many of your data were clustered with that mean, and make your computations.
However, it's not clear in your case what is K. You can try and run the algorithm for several K values until you get your constraints (the n+500 distance between the means)
You can use:
import numpy as np
l = np.array([2000,2200,5000,2350])
# find similar numbers (that are within each 500 fold)
similar = l // 500
# for each similar group get the average and convert it to integer (as in the desired output)
new_list = [np.average(l[similar == num]).astype(int) for num in np.unique(similar)]
print(new_list)
Output:
[2183, 5000]
Step 1:
list = [5620.77978515625,
7388.43017578125,
7683.580078125,
8296.6513671875,
8320.82421875,
8557.51953125,
8743.5,
9163.220703125,
9804.7939453125,
9913.86328125,
9940.1396484375,
9951.74609375,
10074.23828125,
10947.0419921875,
11048.662109375,
11704.099609375,
11958.5,
11964.8232421875,
12335.70703125,
13103.0,
13129.529296875,
16463.177734375,
16930.900390625,
17712.400390625,
18353.400390625,
19390.96484375,
20089.0,
34592.15625,
36542.109375,
39478.953125,
40782.078125,
41295.26953125,
42541.6796875,
42893.58203125,
44578.27734375,
45077.578125,
48022.2890625,
52535.13671875,
58330.5703125,
61597.91796875,
62757.12890625,
64242.79296875,
64863.09765625,
66930.390625]
Step 2:
seen = [] #to log used indices pairs
diff_dic = {} #to record indices and diff
for i,a in enumerate(list):
for j,b in enumerate(list):
if i!=j and (i,j)[::-1] not in seen:
seen.append((i,j))
diff_dic[(i,j)] = abs(a-b)
keys = []
for ind, diff in diff_dic.items():
if diff <= 850:
keys.append(ind)
uniques_k = [] #to record unique indices
for pair in keys:
for key in pair:
if key not in uniques_k:
uniques_k.append(key)
import numpy as np
list_arr = np.array(list)
nearest_avg = np.mean(list_arr[uniques_k])
list_arr = np.delete(list_arr, uniques_k)
list_arr = np.append(list_arr, nearest_avg)
list_arr
output:
array([ 5620.77978516, 34592.15625, 36542.109375, 39478.953125, 48022.2890625, 52535.13671875, 58330.5703125 , 61597.91796875, 62757.12890625, 66930.390625 , 20566.00205365])
You just need a conditional list comprehension like this:
l = [2000,2200,5000,2350]
n = 2000
a = [ (x) for x in l if ((n -250) < x < (n + 250)) ]
Then you can average with
np.mean(a)
or whatever method you prefer.
I want to construct a list of 100 randomly generated share prices by generate 100 random 4-letter-names for the companies and a corresponding random share price.
So far, I have written the following code which provides a random 4-letter company name:
import string
import random
def stock_generator():
return ''.join(random.choices(string.ascii_uppercase, k=4))
stock_name_generator()
# OUTPUT
'FIQG'
But, I want to generate 100 of these with accompanying random share prices. It is possible to do this while keeping the list the same once it's created (i.e. using a seed of some sort)?
I think this approach will work for your task.
import string
import random
random.seed(0)
def stock_generator():
return (''.join(random.choices(string.ascii_uppercase, k=4)), random.random())
parse_result =[]
n=100
for i in range(0,n):
parse_result.append(stock_generator())
print(parse_result)
import string
import random
random.seed(0)
def stock_generator(n=100):
return [(''.join(random.choices(string.ascii_uppercase, k=4)), random.random()) for _ in range(n)]
stocks = stock_generator()
print(stocks)
You can generate as many random stocks as you want with this generator expression. stock_generator() returns a list of tuples of random 4-letter names and a random number between 0 and 1. I image a real price would look different, but this is how you'd start.
random.seed() lets you replicate the random generation.
Edit: average stock price as additionally requested
average_price = sum(stock[1] for stock in stocks) / len(stocks)
stocks[i][1] can be used to access the price in the name/price tuple.
You can generate a consistent n random samples by updating the seed of random before shuffle. Here is an example on how to generate these list of names (10 samples):
import random, copy
sorted_list = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
i = 0 #counter
n = 10 # number of samples
l = 4 # length of samples
myset = set()
while i < n:
shuffled_list = copy.deepcopy(sorted_list)
random.seed(i)
random.shuffle(shuffled_list)
name = tuple(shuffled_list[:l])
if name not in myset:
myset.add(name)
i+=1
print(sorted([''.join(list(x)) for x in myset]))
# ['CKEM', 'DURP', 'GQXO', 'JFWI', 'JNRX', 'MNSV', 'OAXS', 'TIFX', 'VLZS', 'XYLK']
Then you can randomly generate n number of prices and create a list of tuples that binds each name to a price:
names = sorted([''.join(list(x)) for x in myset])
int_list = random.sample(range(1, 100), n)
prices = [x/10 for x in int_list]
names_and_prices = []
for name, price in zip(names,prices):
names_and_prices.append((name,price))
# [('CKEM', 1.5), ('DURP', 1.7), ('GQXO', 6.5), ('JFWI', 7.6), ('JNRX', 0.9), ('MNSV', 8.9), ('OAXS', 5.0), ('TIFX', 9.6), ('VLZS', 1.4), ('XYLK', 3.8)]
Try:
import string
import random
def stock_generator(num):
names = []
for n in range(num):
x = ''.join(random.choices(string.ascii_uppercase, k=4))
names.append(x)
return names
print(stock_generator(100))
Each time you will call the function stock_generator using a number of your choice as parameter, you'll generate the stock name you need.
I have a dataframe containing 15K+ strings in the format of xxxx-yyyyy-zzz. The yyyyy is a random 5 digit number generated. Given that I have xxxx as 1000 and zzz as 200, how can I generate the random yyyyy and add it to the dataframe so that the string is unique?
number
0 1000-12345-100
1 1000-82045-200
2 1000-93035-200
import pandas as pd
data = {"number": ["1000-12345-100", "1000-82045-200", "1000-93035-200"]}
df = pd.DataFrame(data)
print(df)
I'd generate a new column with just the middle values and generate random numbers until you find one that's not in the column.
from random import randint
df["excl"] = df.number.apply(lambda x:int(x.split("-")[1]))
num = randint(10000, 99999)
while num in df.excl.values:
num = randint(10000, 99999)
I tried to come up with a generic approach, you can use this for lists:
import random
number_series = ["1000-12345-100", "1000-82045-200", "1000-93035-200"]
def rnd_nums(n_numbers: int, number_series: list, max_length: int=5, prefix: int=1000, suffix: int=100):
# ignore following numbers
blacklist = [int(x.split('-')[1]) for x in number_series]
# define space with allowed numbers
rng = range(0, 10**max_length)
# get unique sample of length "n_numbers"
lst = random.sample([i for i in rng if i not in blacklist], n_numbers)
# return sample as string with pre- and suffix
return ['{}-{:05d}-{}'.format(prefix, mid, suffix) for mid in lst]
rnd_nums(5, number_series)
Out[69]:
['1000-79396-100',
'1000-30032-100',
'1000-09188-100',
'1000-18726-100',
'1000-12139-100']
Or use it to generate new rows in a dataframe Dataframe:
import pandas as pd
data = {"number": ["1000-12345-100", "1000-82045-200", "1000-93035-200"]}
df = pd.DataFrame(data)
print(df)
df.append(pd.DataFrame({'number': rnd_nums(5, number_series)}), ignore_index=True)
Out[72]:
number
0 1000-12345-100
1 1000-82045-200
2 1000-93035-200
3 1000-00439-100
4 1000-36284-100
5 1000-64592-100
6 1000-50471-100
7 1000-02005-100
In addition to the other suggestions, you could also write a function that takes your df and the amount of new numbers you would like to add as arguments, appends it with the new numbers and returns the updated df. The function could look like this:
import pandas as pd
import random
def add_number(df, num):
lst = []
for n in df["number"]:
n = n.split("-")[1]
lst.append(int(n))
for i in range(num):
check = False
while check == False:
new_number = random.randint(10000, 99999)
if new_number not in lst:
lst.append(new_number)
l = len(df["number"])
df.at[l+1,"number"] = "1000-%i-200" % new_number
check = True
df = df.reset_index(drop=True)
return df
This would have the advantage that you could use the function every time you want to add new numbers.
try:
import random
df['number'] = [f"1000-{x}-200" for x in random.sample(range(10000, 99999), len(df))]
output:
number
0 1000-24744-200
1 1000-28991-200
2 1000-98322-200
...
One option is to use sample from the random module:
import random
num_digits = 5
col_length = 15000
rand_nums = random.sample(range(10**num_digits),col_length)
data["number"]=['-'.join(
'1000',str(num).zfill(num_digits),'200')
for num in rand_nums]
It took my computer about 30 ms to generate the numbers. For numbers with more digits, it may become infeasible.
Another option is to just take sequential integers, then encrypt them. This will result in a sequence in which each element is unique. They will be pseudo-random, rather than truly random, but then Python's random module is producing pseudo-random numbers as well.
I have two set of datas which I would like to multiply one by each other, and store the result in an array for each value.
For now I have this:
import csv
from mpdaf.obj import Spectrum, WaveCoord
import matplotlib.pyplot as plt
import pandas as pd
from csv import reader
file_path = input("Enter full transmission curve path : ")
with open(file_path, 'rw') as f:
data = list(reader(f, delimiter=","))
wavelength = [i[0] for i in data]
percentage = [float(str(i[1]).replace(',','.')) for i in data]
spectrum = input("Full spectrum path : ")
spe = Spectrum(filename=spectrum, ext=0)
data_flux = spe.data
flux_array = []
for i in percentage:
for j in data_flux:
flux = i*j
flux_array.append(flux)
print(flux_array)
Like this it take the first i then multiply it by all the j then takes the next i etc etc ...
I would like to just multiply the first i by the first j, then store the value in the array, then multiply the 2nd i by the second j and store the value etc ...
It is as the error message says: your indices i and j are floats, not integers. When you write for i in percentage:, i takes on every value in the percentage list. Instead, you might want to iterate through a range. Here's an example to illustrate the difference:
percentage = [50.0, 60.0, 70.0]
for i in percentage:
print(i)
# 50.0
# 60.0
# 70.0
for i in range(len(percentage)):
print(i)
# 0
# 1
# 2
To iterate through a list of indices, you probably want to iterate through a range:
for i in range(len(percentage)):
for j in range(len(data_flux)):
flux = percentage[i]*data_flux[j]
flux_array.append(flux)
This will iterate through the integers of each list, starting at 0 and ending at the maximum index of the list.
cust_id = semi_final_df['0_x'].tolist()
date = semi_final_df[1].tolist()
total_amount = semi_final_df[0].tolist()
prod_num = semi_final_df['0_y'].tolist()
prod_deduped = []
quant_cleaned = []
product_net_amount = []
cust_id_final = []
date_final = []
for row in total_amount:
quant_cleaned.append(float(row))
for unique_prodz in prod_num:
if unique_prodz not in prod_deduped:
prod_deduped.append(unique_prodz)
for unique_product in prod_deduped:
indices = [i for i, x in enumerate(prod_num) if x == unique_product]
product_total = 0
for index in indices:
product_total += quant_cleaned[index]
product_net_amount.append(product_total)
first_index = prod_num.index(unique_product)
cust_id_final.append(cust_id[first_index])
date_final.append(date[first_index])
Above code calculates sum amount by one condition in order to sum the total on an invoice.
The data had multiple lines but shared the same invoice/product number.
Problem:
I need to modify the below code so that I can sum by unique product and unique date.
I have given it a go but I am getting a value error -
saying x, y is not in a list
As per my understanding the issue lies in the fact that I am zipping two de-duped lists together of different lengths and then I am attempting to loop through the result inline.
This line causes the error
for i,[x, y] in enumerate(zipped_list):
Any help would be sincerely appreciated. Here is the second batch of code with comments.
from itertools import zip_longest
#I have not included the code for the three lists below but you can assume they are populated as these are the lists that I will be #working off of. They are of the same length.
prod_numbers = []
datesz = []
converted_quant = []
#Code to dedupe date and product which will end up being different lengths. These two lists are populated by the two for loops below
prod_deduped = []
dates_deduped = []
for unique_prodz in prod_numbers:
if unique_prodz not in prod_deduped:
prod_deduped.append(unique_prodz)
for unique_date in datesz:
if unique_date not in dates_deduped:
dates_deduped.append(unique_date)
#Now for the fun part. Time to sum by date and product. The three lists below are empty until we run the code
converted_net_amount = []
prod_id_final = []
date_final = []
#I zipped the list together using itertools which I imported at the top
for unique_product, unique_date in zip_longest(prod_deduped, dates_deduped, fillvalue = ''):
indices = []
zipped_object = zip(prod_numbers, datesz)
zipped_list = list(zipped_object)
for i,[x, y] in enumerate(zipped_list):
if x == unique_product and y == unique_date:
indices.append(i)
converted_total = 0
for index in indices:
converted_total += converted_quant[index]
converted_net_amount.append[converted_total]
first_index = zipped_list.index([unique_product, unique_date])
prod_id_final.append(prod_numbers[first_index])
date_final.append(datesz[first_index])
from collections import defaultdict
summed_dictionary = defaultdict(int)
for x, y, z in list:
summed_dictionary[(x,y)] += z
Using defaultdict should solve your problem and is a lot easier on the eyes than all your code above. I saw this on reddit this morning and figured you crossposted. Credit to the guy from reddit on /r/learnpython