How to automate the process to select the clusters using the labels - python

So I'm new to using python and I'm working in the analyze of some data, I'm using a process extremely manual to find the clusters, first I get the labels using the method from the library:
labels = optics_model.labels_[optics_model.ordering_]
then I use the command angwhere to find the index values that have that label:
cluster_0 = np.argwhere(labels == 0)
then I print this data, use another site to clean the data and use it to select from the dataframe the rows that are from this cluster:
index_0 = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
cluster_0 = df.iloc[index_0]
can someone help me automate this process?

So after some looking and testing I made it work for me using a method to add a column to the dataframe with the labels:
df_copy = df.assign(labels=labels)
then I calculated the number of clusters using this:
max = 0
for i in range(len(labels)):
if max < labels[i]:
max = labels[i]
then a made the necessary number of empty dataframes:
cluster = {}
for i in range(max):
cluster[i] = pd.DataFrame()
then I just copy the data I want from the dataframe:
for i in range(0, max):
cluster[i] = df_copy.loc[df_copy['labels'] == i]

Related

I want to multiply two columns in different dataframes

So I have a dataframe named data1 with column named 'E-E11'in it and another dataframe named Volx with a column 'EVOL' in it. I want to multiply them and it doesn't work I get a KeyError 'E-E11'.All of the columns contain 332924 values.
used this
Volx = pd.read_csv('BCCdir1VOL.csv') #already floats in dataframe
Volx.drop(Volx.columns[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]], inplace=True, axis=1) # have one column in my data frame
data1 = pd.read_csv('abaqusBCC1Dir.csv') #already floats in dataframe
data1.drop(data1.columns[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15]], inplace=True, axis=1) # have one column in my data frame
def getPower(data1, Multiplicationx, numOfCol):
for i in range(numOfCol):
Volx = 'EVOL' % (i+1)
E11x = 'E-E11' % (i+1)
Multiplicationx = 'E11x_V' % (i+1)
data1[Multiplicationx] = data1[E11x]*Volx[Volx]
data1[Multiplicationx] = data1['E-E11']*Volx['EVOL']
instead of getting a column Multiplicationx as a new data frame of multiplying two other datat frames, I get KeyError 'E-E11'. Please help me?
It's kinda of hard to tell what's going on, but I don't understand 'EVOL' % (i+1).
Try:
Volx = f'EVOL{i+1}'
E11x = f'E-E11{i+1}'
Multiplicationx = f'E11x_V{i+1}'
data1[Multiplicationx] = data1[E11x] * Volx

How do I get index of a specific value (in second dataframe) based on the same value in first dataframe

I have 2 data frames, df_ts and df_cmexport. I am trying to get the index of placement id in df_cmexport for the placements in df_ts
Refer to get an idea of the explanation : Click here to view excel file
Once I have the index of those placement id's as a list, I will iterate through them using for j in list_pe_ts_1: to get some value for 'j' index as such : df_cmexport['p_start_year'][j].
My code below returns an empty list for some reason print(list_pe_ts_1) returns []
I think something wrong with list_pe_ts_1 = df_cmexport.index[df_cmexport['Placement ID'] == pid_1].tolist() as this returens empty list of length 0
I even tried using list_pe_ts_1 = df_cmexport.loc[df_cmexport.isin([pid_1]).any(axis=1)].index but still gives a empty list
Help is always appreciated :) Cheers to you all #stackoverflow
for i in range(0, len(df_ts)):
pid_1 = df_ts['PLACEMENT ID'][i]
print('for pid ', pid_1)
list_pe_ts_1 = df_cmexport.index[df_cmexport['Placement ID'] == pid_1].tolist()
print('len of list',len(list_pe_ts_1))
ts_p_start_year_for_pid = df_ts['p_start_year'][i]
ts_p_start_month_for_pid = df_ts['p_start_month'][i]
ts_p_start_day_for_pid = df_ts['p_start_date'][i]
print('\np_start_full_date_ts for :', pid_1, 'y:', ts_p_start_year_for_pid, 'm:', ts_p_start_month_for_pid,
'd:', ts_p_start_day_for_pid)
# j=list_pe_ts
print(list_pe_ts_1)
for j in list_pe_ts_1:
# print(j)
export_p_start_year_for_pid = df_cmexport['p_start_year'][j]
export_p_start_month_for_pid = df_cmexport['p_start_month'][j]
export_p_start_day_for_pid = df_cmexport['p_start_date'][j]
print('\np_start_full_date_export for ', pid, "at row(", j, ") :", export_p_start_year_for_pid,
export_p_start_month_for_pid, export_p_start_day_for_pid)
if (ts_p_start_year_for_pid == export_p_start_year_for_pid) and (
ts_p_start_month_for_pid == export_p_start_month_for_pid) and (
ts_p_start_day_for_pid == export_p_start_day_for_pid):
pids_p_1.add(pid_1)
# print('pass',pids_p_1)
# print(export_p_end_year_for_pid)
else:
pids_f_1.add(pid_1)
# print("mismatch in placement end date for pid ", pids)
# print("pids list ",pids)
# print('fail',pids_f_1)
With below snippest you can get a list of the matching index field from seconds dataframe.
import pandas as pd
df_ts = pd.DataFrame(data = {'index in df':[0,1,2,3,4,5,6,7,8,9,10,11,12],
"pid":[1,1,2,2,3,3,3,4,6,8,8,9,9],
})
df_cmexport = pd.DataFrame(data = {'index in df':[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
"pid":[1,1,1,2,3,3,3,3,3,4,4,4,5,5,6,7,8,8,9,9,9],
})
Create new dataframe by mearging the two
result = pd.merge(df_ts, df_cmexport, left_on=["pid"], right_on=["pid"], how='left', indicator='True', sort=True)
Then identify unique values in "index in df_y" dataframe
index_list = result["index in df_y"].unique()
The result you get;
index_list
Out[9]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16, 17, 18, 19,
20], dtype=int64)

How to convert csv to multiple arrays without pandas?

I have an csv file like this:
student_id,event_id,score
1,1,20
3,1,20
4,1,18
5,1,13
6,1,18
7,1,14
8,1,14
9,1,11
10,1,19
...
and I need to convert it into multiple arrays/lists like I did using pandas here:
scores = pd.read_csv("/content/score.csv", encoding = 'utf-8',
index_col = [])
student_id = scores['student_id'].values
event_id = scores['event_id'].values
score = scores['score'].values
print(scores.head())
As you can see, I get three arrays, which I need in order to run the data analysis. How can I do this using Python's CSV library? I have to do this without the use of pandas. Also, how can I export data from multiple new arrays into a csv file when I am done with this data? I, again, used panda to do this:
avg = avgScore
max = maxScore
min = minScore
sum = sumScore
id = student_id_data
dict = {'avg(score)': avg, 'max(score)': max, 'min(score)': min, 'sum(score)': sum, 'student_id': id}
df = pd.DataFrame(dict)
df.to_csv(r'/content/AnalyzedData.csv', index=False)
Those first 5 are arrays if you are wondering.
Here's a partial answer which will produce a separate list for each column in the CSV file.
import csv
csv_filepath = "score.csv"
with open(csv_filepath, "r", newline='') as csv_file:
reader = csv.DictReader(csv_file)
columns = reader.fieldnames
lists = {column: [] for column in columns} # Lists for each column.
for row in reader:
for column in columns:
lists[column].append(int(row[column]))
for column_name, column in lists.items():
print(f'{column_name}: {column}')
Sample output:
student_id: [1, 3, 4, 5, 6, 7, 8, 9, 10]
event_id: [1, 1, 1, 1, 1, 1, 1, 1, 1]
score: [20, 20, 18, 13, 18, 14, 14, 11, 19]
You also asked how to do the reverse of this. Here's an example I how is self-explanatory:
# Dummy sample analysis data
length = len(lists['student_id'])
avgScore = list(range(length))
maxScore = list(range(length))
minScore = list(range(length))
sumScore = list(range(length))
student_ids = lists['student_id']
csv_output_filepath = 'analysis.csv'
fieldnames = ('avg(score)', 'max(score)', 'min(score)', 'sum(score)', 'student_id')
with open(csv_output_filepath, 'w', newline='') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames)
writer.writeheader()
for values in zip(avgScore, maxScore, minScore, sumScore, student_ids):
row = dict(zip(fieldnames, values)) # Combine into dictionary.
writer.writerow(row)
What you want to do does not require the csv module, it's just three lines of code (one of them admittedly dense)
splitted_lines = (line.split(',') for line in open('/path/to/you/data.csv')
labels = next(splitted_lines)
arr = dict(zip(labels,zip(*((int(i) for i in ii) for ii in splitted_lines))))
splitted_lines is a generator that iterates over your data file one line at a time and provides you a list with the three (in your example) items in each line, line by line.
next(splitted_lines) returns the list with the (splitted) content of the first line, that is our three labels
We fit our data in a dictionary; using the class init method (i.e., by invoking dict) it is possible to initialize it using a generator of 2-uples, here the value of a zip:
zip 1st argument is labels, so the keys of the dictionary will be the labels of the columns
the 2nd argument is the result of the evaluation of an inner zip but in this case zip is used because zipping the starred form of a sequence of sequences has the effect of transposing it... so the value associated to each key will be the transpose of what follows * …
what follows the * is simply (the generator equivalent of) a list of lists with (in your example) 9 rows of three integer values so that
the second argument to the 1st zip is consequently a sequence of three sequences of nine integers, that are going to be coupled to the corresponding three keys/labels
Here I have an example of using the data collected by the previous three lines of code
In [119]: print("\n".join("%15s:%s"%(l,','.join("%3d"%i for i in arr[l])) for l in labels))
...:
student_id: 1, 3, 4, 5, 6, 7, 8, 9, 10
event_id: 1, 1, 1, 1, 1, 1, 1, 1, 1
score: 20, 20, 18, 13, 18, 14, 14, 11, 19
In [120]: print(*arr['score'])
20 20 18 13 18 14 14 11 19
PS If the question were about an assignment in a sort of Python 101 it's unlikely that my solution would be deemed acceptable

Is there a way to find the nᵗʰ entry in itertools.combinations() without converting the entire thing to a list?

I am using the itertools library module in python.
I am interested the different ways to choose 15 of the first 26000 positive integers. The function itertools.combinations(range(1,26000), 15) enumerates all of these possible subsets, in a lexicographical ordering.
The binomial coefficient 26000 choose 15 is a very large number, on the order of 10^54. However, python has no problem running the code y = itertools.combinations(range(1,26000), 15) as shown in the sixth line below.
If I try to do y[3] to find just the 3rd entry, I get a TypeError. This means I need to convert it into a list first. The problem is that trying to convert it into a list gives a MemoryError. All of this is shown in the screenshot above.
Converting it into a list does work for smaller combinations, like 6 choose 3, shown below.
My question is:
Is there a way to access specific elements in itertools.combinations() without converting it into a list?
I want to be able to access, say, the first 10000 of these ~10^54 enumerated 15-element subsets.
Any help is appreciated. Thank you!
You can use a generator expression:
comb = itertools.combinations(range(1,26000), 15)
comb1000 = (next(comb) for i in range(1000))
To jump directly to the nth combination, here is an itertools recipe:
def nth_combination(iterable, r, index):
"""Equivalent to list(combinations(iterable, r))[index]"""
pool = tuple(iterable)
n = len(pool)
if r < 0 or r > n:
raise ValueError
c = 1
k = min(r, n-r)
for i in range(1, k+1):
c = c * (n - k + i) // i
if index < 0:
index += c
if index < 0 or index >= c:
raise IndexError
result = []
while r:
c, n, r = c*r//n, n-1, r-1
while index >= c:
index -= c
c, n = c*(n-r)//n, n-1
result.append(pool[-1-n])
return tuple(result)
It's also available in more_itertools.nth_combination
>>> import more_itertools # pip install more-itertools
>>> more_itertools.nth_combination(range(1,26000), 15, 123456)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 18, 19541)
To instantly "fast-forward" a combinations instance to this position and continue iterating, you can set the state to the previously yielded state (note: 0-based state vector) and continue from there:
>>> comb = itertools.combinations(range(1,26000), 15)
>>> comb.__setstate__((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 17, 19540))
>>> next(comb)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 18, 19542)
If you want to access the first few elements, it's pretty straightforward with islice:
import itertools
print(list(itertools.islice(itertools.combinations(range(1,26000), 15), 1000)))
Note that islice internally iterates the combinations up to the specified point, so it can't magically give you the middle elements without iterating all the way there. You'd have to go down the route of computing the elements you want combinatorially in that case.

Generate two different random number lists [duplicate]

This question already has answers here:
Python random number excluding one variable
(6 answers)
Closed 3 years ago.
I want to generate two differnt random number lists.
The condition is the value of the first list index cannot be equal to the second list value in the same index.
For example a=[5,6,7,5] and q=[2,7,3,5] in this case the value of the fourth index in list q is equal to the value in the same index in list a. I want to avoid this. I created list a as folllowing
import random
a=[]
b=list(range(1,7164))
for i in b:
t=random.randint(1,20)
a.append(t)
how to generate the second list with above condition?
import random
def generate_n_lists(num_of_lists, num_of_elements, value_from=0, value_to=100):
s = random.sample(range(value_from, value_to + 1), num_of_lists * num_of_elements)
return [s[i*num_of_elements:(i+1)*num_of_elements] for i in range(num_of_lists)]
print(generate_n_lists(2, 5, 0, 20)) # generate 2 lists, each 5 elements, values are from 0 to 20
Prints:
[[1, 16, 4, 3, 15], [0, 10, 14, 17, 7]]
This creates a and q as tuples, but you can easily convert them to lists.
In [29]: import random
In [30]: size = 15
In [31]: maxval = 20
In [32]: a, q = zip(*[random.sample(range(1, maxval+1), 2) for z in range(size)])
In [33]: a
Out[33]: (18, 7, 12, 6, 17, 16, 12, 1, 14, 20, 9, 5, 8, 5, 18)
In [34]: q
Out[34]: (12, 10, 6, 1, 12, 15, 20, 7, 6, 10, 5, 7, 16, 7, 10)
The best approach I think would be is to iterate over each item and offset it with a random number in a way that it can't be the same as the original value.
Add the following to the end of your code:
c = []
for i in range(len(a)):
t = (a[i] + random.randint(1, 19)) % 19 + 1
c.append(t)
This way you offset each of the items with a number between 1-18, and wrap it around if it goes above 19. (+1 so it's between 1 and 19, not 0)
To avoid it, just check if it is repeating. If it is, generate a different random number again.
import random
a=[]
b=list(range(1,7164))
for i in b:
t=random.randint(1,20)
while t == i:
t = random.randint(1,20)
a.append(t)
print(a)
import random
a=[]
b=[]
for rand_a in range(1,7164):
a.append(random.randint(1,20))
random.seed()
for rand_b in range(1,7164):
r = random.randint(1,20)
# keep rolling until you get a diff number
while (a[rand_b] == r):
r = random.randint(1,20)
b.append(r)
your code example had 1 random list and 1 list of 1,7164.
This code will generate you two lists of 1,20 with a total of 17164 elements all of differing values based on their respective position in the other list.
seed probably isn't needed but
There are multiple ways to do this.
One would be to generate the second random value in a smaller range, and offset if it equals or exceeds the excluded value:
excluded_value = first_list[i]
new_value = random.randint(1, 19)
if new_value >= excluded_value:
new_value += 1
Another is to generate the lists at the same time, using random.sample to select without replacement.
possible_values = range(1, 20) # or xrange on python 2.x
while i < desired_num_values:
a, b = random.sample(possible_values, 2)
first_list.append(a)
second_list.append(b)
i += 1
I have not profiled to see if there's a notable performance difference. Both seem likely to be faster than repeatedly generating a random number until there isn't a conflict (but again, I haven't profiled to confirm). The second scales more gracefully if you want more than two lists.
These are not the only ways to do this.

Categories

Resources