How to calculate the number of occurrences between data in excel? - python
I have a huge CSV table of thousands of data, I want to make a table of number of occurrence of two elements together divided by how many that element presented
[
Like Bitcoin appeared 8 times in this rows with 2 times with API so the relation between bitcoin to API: is that API always exists with bitcoin so the value of API appearing with bitcoin is 1 and bitcoin appearing with API is 1/4.
I want something looks like this in the end
How I can do it with python or any other tool?
This is sample of file
sample of the file
This, I think, does do the job. I typed your spreadsheet into a csv by hand (would have been nice to be able to cut and paste), and the results seem reasonable.
import itertools
import csv
import numpy as np
words = {}
for row in open('input.csv'):
parts = row.rstrip().split(',')
for a,b in itertools.combinations(parts,2):
if a not in words:
words[a] = [b]
else:
words[a].append( b )
if b not in words:
words[b] = [a]
else:
words[b].append( a )
print(words)
size = len(words)
keys = list(words.keys())
track = np.zeros((size,size))
for i,k in enumerate(keys):
track[i,i] = len(words[k])
for j in words[k]:
track[i,keys.index(j)] += 1
track[keys.index(j),i] += 1
print(keys)
# Scale to [0,1].
for row in range(track.shape[0]):
track[row,:] /= track[row,row]
# Create a csv with the results.
fout = open('corresp.csv','w')
print( ','.join([' ']+keys), file=fout )
for row in range(track.shape[0]):
print( keys[row], file=fout, end=',')
print( ','.join(f"{track[row,i]}" for i in range(track.shape[1])), file=fout )
Here's the first few lines of the result:
,API,Backend Development,Bitcoin,Docker,Article Rewriting,Article writing,Blockchain,Content Writing,Ghostwriting,Android,Ethereum,PHP,React.js,C Programming,C++ Programming,ASIC,Digital ASIC Coding,Embedded Software,Article Writing,Blog,Copy Typing,Affiliate Marketing,Brand Marketing,Bulk Marketing,Sales,BlockChain,Business Strategy,Non-fungible Tokens,Technical Writing,.NET,Arduino,Software Architecture,Bluetooth Low Energy (BLE),C# Programming,Ada programming,Programming,Haskell,Rust,Algorithm,Java,Mathematics,Machine Learning (ML),Matlab and Mathematica,Data Entry,HTML,Circuit Designs,Embedded Systems,Electronics,Microcontroller, C++ Programming,Python
API,1.0,0.14285714285714285,0.5714285714285714,0.14285714285714285,0.0,0.0,0.2857142857142857,0.0,0.0,0.0,0.14285714285714285,0.0,0.14285714285714285,0.2857142857142857,0.2857142857142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Backend Development,0.6666666666666666,1.0,0.6666666666666666,0.6666666666666666,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bitcoin,0.21052631578947367,0.05263157894736842,1.0,0.05263157894736842,0.0,0.0,0.2631578947368421,0.0,0.0,0.05263157894736842,0.10526315789473684,0.10526315789473684,0.05263157894736842,0.15789473684210525,0.21052631578947367,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.05263157894736842,0.0,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Docker,0.6666666666666666,0.6666666666666666,0.6666666666666666,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I had a look at this by creating a pivot table in Excel for every combination of columns there are: AB AC, AD, BC, BD, CD and putting the unique entries from the first column, eg A, in the rows and the unique entries from the second, eg B, in the column and then putting column A in the values area, I find all matches and the count of all matches
This is a clunky method but I note from the Python based method that has been submitted, my answer is essentially no more or less clunky than that!
Related
How to display a count of combinations from a data set in python [duplicate]
This question already has answers here: Count occurrences in DataFrame (2 answers) Closed 5 months ago. I have a data set of customers and products and I would like to know which combination of products are more popular combinations chosen by customers and display that in a table (like a traditional Mileage chart or other neat way). Example dataset: Example output: I am able to tell that the most popular combination of products for customers are P1 with P2 and the least popular is P1 with P3. My actual dataset is of course much larger in terms of customers and products. I'd also be keen to hear any ideas on better outputs visualisations too, especially as I can't figure out how to best display 3 way or 4 way popular combinations. Thank you
I have a full code example that may work for what you are doing... or at least give you some ideas on how to move forward. This script uses OpenPyXl to scrape the info from the first sheet. It is turned into a dictionary where the key's are strings of the combinations. The combinations are then counted and it is then placed into a 2nd sheet (see image). Results: The Code: from openpyxl import load_workbook from collections import Counter #Load master workbook/Worksheet and the file to be processed data_wb = load_workbook(filename=r"C:\\Users\---Place your loc here---\SO_excel.xlsx") data_ws = data_wb['Sheet1'] results_ws = data_wb['Sheet2'] #Finding Max rows in sheets data_max_rows = data_ws.max_row results_max_rows = results_ws.max_row #Collecting Values and placing in array customer_dict = {} for row in data_ws.iter_rows(min_row = 2, max_col = 2, max_row = data_max_rows): #service_max_rows #Gathering row values and creatin var's for relevant ones row_list = [cell.value for cell in row] customer_cell = row_list[0] product_cell = row_list[3] #Creating Str value for dict if customer_cell not in customer_dict: words = "" words += product_cell customer_dict.update({customer_cell:words}) else: words += ("_" + product_cell) customer_dict.update({customer_cell:words}) #Counting Occurances in dict for keys count_dict = Counter(customer_dict.values()) #Column Titles results_ws.cell(1, 1).value = "Combonation" results_ws.cell(1, 2).value = "Occurances" #Placing values into spreadsheet count = 2 for key, value in count_dict.items(): results_ws.cell(count, 1).value = key results_ws.cell(count, 2).value = value count += 1 data_wb.save(filename = r"C:\\Users\---Place your loc here---\SO_excel.xlsx") data_wb.close()
How to divide a pandas data frame into sublists of n at a time?
I have a data frame made of tweets and their author, there is a total of 45 authors. I want to divide the data frame into groups of 2 authors at a time such that I can export them later into csv files. I tried using the following: (given that the authors are in column named 'B' and the tweets are in columns named 'A') I took the following from this question df.set_index(keys=['B'],drop=False,inplace=True) authors = df['B'].unique().tolist() in order to separate the lists : dgroups =[] for i in range(0,len(authors)-1,2): dgroups.append(df.loc[df.B==authors[i]]) dgroups.extend(df.loc[df.B ==authors[i+1]]) but instead it gives me sub-lists like this: dgroups = [['A'],['B'], [tweet,author], ['A'],['B'], [tweet,author2]] prior to this I was able to divide them correctly into 45 sub-lists derived from the previous link 1 as follows: for i in authors: groups.append(df.loc[df.B==i]) so how would i do that for 2 authors or 3 authors or like that? EDIT: from #Jonathan Leon answer, i thought i would do the following, which worked but isn't a dynamic solution and is inefficient i guess, especially if n>3 : dgroups= [] for i in range(2,len(authors)+1,2): tempset1=[] tempset2=[] tempset1 = df.loc[df.B==authors[i-2]] if(i-1 != len(authors)): tempset2=df.loc[df.B ==authors[i-1]] dgroups.append(tempset1.append(tempset2)) else: dgroups.append(tempset1)
This imports the foreign language incorrectly, but the logic works to create a new csv for every two authors. pd.read_csv('TrainDataAuthorAttribution.csv') # df.groupby('B').count() authors=df.B.unique().tolist() auths_in_subset = 2 for i in range(auths_in_subset, len(authors)+auths_in_subset, auths_in_subset): # print(authors[i-auths_in_subset:i]) dft = df[df.B.isin(authors[i-auths_in_subset:i])] # print(dft) dft.to_csv('df' + str(i) + '.csv')
Pandas very slow query
I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size is so slow that it takes more than 15 mins. Is there a way to make the query faster? raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv') data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"] illnesses = pd.DataFrame({"Finding_Label":[], "Count_of_Patientes_Having":[], "Count_of_Times_Being_Shown_In_An_Image":[]}) ids = raw_data["Patient ID"].drop_duplicates() index = 0 for ctr in data[:1]: illnesses.at[index, "Finding_Label"] = ctr illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12 for i in ids: illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size index = index + 1 Part of dataframes: Raw_data Finding Labels - Patient ID IllnessA|IllnessB - 1 Illness A - 2
From what I read I understand that ctr stands for the name of a disease. When you are doing this query: raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease. This would be: raw_data[raw_data['Finding Labels'].str.contains(ctr)].size And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe). Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time. The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate: for index, ctr in enumerate(data[:1]): illnesses.at[index, "Finding_Label"] = ctr illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12 illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)]) I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing. Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient. # CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED) raw_data = (raw_data.assign(key=1) .merge(pd.DataFrame({'ills':ills, 'key':1}), on='key') .drop(columns=['key']) ) # SUBSET BY ILLNESS CONTAINED IN LONGER STRING raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)] # CALCULATE GROUP BY count AND distinct count def count_distinct(grp): return (grp.groupby('Patient ID').size()).size illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(), 'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)}) To demonstrate, consider below with random, seeded input data and output. Input Data (attempting to mirror original data) import numpy as np import pandas as pd alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789' data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia'] ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"] np.random.seed(542019) raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25), 'Finding Labels': np.core.defchararray.add( np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]), np.random.choice(ills, 25).astype('str')), np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)])) }) print(raw_data.head(10)) # Patient ID Finding Labels # 0 r xPNPneumothoraxXYm # 1 python ScSInfiltration9Ud # 2 stata tJhInfiltrationJtG # 3 r thLPneumoniaWdr # 4 stata thYAtelectasis6iW # 5 sas 2WLPneumonia1if # 6 julia OPEConsolidationKq0 # 7 sas UFFCardiomegaly7wZ # 8 stata 9NQHerniaMl4 # 9 python NB8HerniapWK Output (after running above process) print(illnesses) # Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having # ills # Atelectasis 3 1 # Cardiomegaly 2 1 # Consolidation 1 1 # Effusion 1 1 # Emphysema 1 1 # Fibrosis 2 2 # Hernia 4 3 # Infiltration 2 2 # Mass 1 1 # Nodule 2 2 # Pleural_Thickening 1 1 # Pneumonia 3 3 # Pneumothorax 2 2
How to make categories out of my text file and calculate average out of the numbers?
I am working on a assignment, but I am stuck and I do not know how to proceed. I need to make different categories out of the different categories from the first line (from the txt file) and calculate averages over every numerical value. The program has to work flawless when I add new lines to the txt file. Category;currency;sellerRating;Duration;endDay;ClosePrice;OpenPrice;Competitive? Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No Music/Automotive/Game;US;3249;5;Mon;0,01;0,01;No Music/Automotive/Game;US;3249;5;Mon;0,01;0,01;No This is the text file. I tried to make different categories out of them, but I do not know if I did it correctly and how to let Python know that he has to calculate all the numbers from 1 group. with open('bijlage2.txt') as bestand: maak_er_lists_van = [(line.strip()).split(';') for line in bestand] keys = maak_er_lists_van[0] lijst = list(zip([keys]*len(maak_er_lists_van[1:]), maak_er_lists_van[1:])) x = [zip(i[0], i[1]) for i in lijst] maak_dict = [dict(i) for i in x] for i in maak_dict: categorieen =[i['Category'], i['currency'], i['sellerRating'], i['Duration'], i['endDay'], i['ClosePrice'], i['OpenPrice'], i['Competitive?']] categorieen = list(map(int, categorieen)) This is what I have so far. I am a Python beginner so the whole text file thing is new to me. Can somebody help me or explain what I have to do so that I can work further on this project? Many thanks in advance!
Here's how I would do it. I had to add using locale.atof() because where I am . is used as the decimal point, not commas. You may have to change this as indicated. The csv module is used to read the file, and the averages are computed in a two-step process. First the values for each category are summed, and then afterwards, the average value of each one is calculated based on the number of values read. import csv import locale from pprint import pprint, pformat import locale #locale.setlocale(locale.LC_ALL, '') # empty string for platform's default settings # Following used for testing to force ',' to be considered as a decimal point. locale.setlocale(locale.LC_ALL, 'French_France.1252') avg_names = 'sellerRating', 'Duration', 'ClosePrice', 'OpenPrice' averages = {avg_name: 0 for avg_name in avg_names} # Initialze. # Find total of each category of interest. num_values = 0 with open('bijlage2.txt', newline='') as bestand: csvreader = csv.DictReader(bestand, delimiter=';') for row in csvreader: num_values += 1 for avg_name in avg_names: averages[avg_name] += locale.atof(row[avg_name]) # Calculate average of each summed value. for avg_name, total in averages.items(): averages[avg_name] = total / num_values print('raw results:') pprint(averages) print() # Formatted output print('Averages:') for avg_name in avg_names: rounded = locale.format_string('%.2f', round(averages[avg_name], 2), grouping=True) print(' {:<13} {:>10}'.format(avg_name, rounded)) Output: raw results: {'ClosePrice': 0.01, 'Duration': 5.0, 'OpenPrice': 0.01, 'sellerRating': 3249.0} Averages: sellerRating 3 249,00 Duration 5,00 ClosePrice 0,01 OpenPrice 0,01
Everything is fine with your way to read the file and creating a dictionary with the categories and values, imo. Your list maak_dict contains one dictionary for every line. To calculate an average for one category, you could do something like this: def calc_average(categ): values = [i[categ] for i in maak_dict] average = sum(values)/len(values) return average assuming that you want to calculate the mean average. categ has to be a string. After that, you can create a new dictionary that contains all the averages: new_dict = {} for category in maak_dict[0].keys(): avg = calc_average(category) new_dict[category] = avg
Finding the best reciprocal hit in a single BLAST file using python
I have a BLAST outfmt 6 output file in standard format, I want to find a way to loop through the file, select each hit, find its reciprocal hit and decipher which is the best hit to store. For example: d = {} for line in input_file: term = line.split('\t') qseqid = term[0] sseqid = term[1] hit = qseqid, sseqid recip_hit = sseqid, qseqid for line in input_file: if recip_hit in line: compare both lines done Example input (tab delimited): Seq1 Seq2 80 1000 10 3 1 1000 100 1100 0.0 500 Seq2 Seq1 95 1000 10 3 100 1100 1 1000 1e-100 500 Can anyone provide any insight how to efficiently tackle this problem? Many thanks in advance
You could approach your problem to find those pairs and compare the lines like this: #create a dictionary to store pairs line_dict = {} #iterate over your file for line in open("test.txt", "r"): line = line[:-1].split("\t") #ignore line, if not at least one value apart from the two sequence IDs if len(line) < 3: continue #identify the two sequences seq = tuple(line[0:2]) #is reverse sequence already in dictionary? if seq[::-1] in line_dict: #append new line line_dict[seq[::-1]].append(line) else: #create new entry line_dict[seq] = [line] #remove entries, for which no counterpart exists pairs = {k: v for k, v in line_dict.items() if len(v) > 1} #and do things with these pairs for pair, seq in pairs.items(): print(pair, "found in:") for item in seq: print(item) The advantage is that you only have to iterate once over your file, because you store all data and discard them only, if you haven't found a matching reversed pair. The disadvantage is that this takes space, so for very large files, this approach might not be feasible. A similar approach - to store all data in your working memory - utilises pandas. This should be faster, since sorting algorithms are optimised for pandas. Another advantage of pandas is that all your other values are already in pandas columns - so further analysis is made easier. I definitely prefer the pandas version, but I don't know, if it is installed on your system. To make things easier to communicate, I assigned a and b to the columns that contain the sequences Seq1 and Seq2. import pandas as pd #read data into a dataframe #not necessary: drop the header of the file, use custom columns names df = pd.read_csv("test.txt", sep='\t', names=list("abcde"), header = 0) #create a column that joins Seq1 - Seq2 or Seq2 - Seq1 to Seq1Seq2 df["pairs"] = df.apply(lambda row: ''.join(sorted([row["a"], row["b"]])), axis = 1) #remove rows with no matching pair and sort the database only_pairs = df[df["pairs"].duplicated(keep = False)].sort_values(by = "pairs") print(only_pairs)