I have a huge CSV table of thousands of data, I want to make a table of number of occurrence of two elements together divided by how many that element presented
[
Like Bitcoin appeared 8 times in this rows with 2 times with API so the relation between bitcoin to API: is that API always exists with bitcoin so the value of API appearing with bitcoin is 1 and bitcoin appearing with API is 1/4.
I want something looks like this in the end
How I can do it with python or any other tool?
This is sample of file
sample of the file
This, I think, does do the job. I typed your spreadsheet into a csv by hand (would have been nice to be able to cut and paste), and the results seem reasonable.
import itertools
import csv
import numpy as np
words = {}
for row in open('input.csv'):
parts = row.rstrip().split(',')
for a,b in itertools.combinations(parts,2):
if a not in words:
words[a] = [b]
else:
words[a].append( b )
if b not in words:
words[b] = [a]
else:
words[b].append( a )
print(words)
size = len(words)
keys = list(words.keys())
track = np.zeros((size,size))
for i,k in enumerate(keys):
track[i,i] = len(words[k])
for j in words[k]:
track[i,keys.index(j)] += 1
track[keys.index(j),i] += 1
print(keys)
# Scale to [0,1].
for row in range(track.shape[0]):
track[row,:] /= track[row,row]
# Create a csv with the results.
fout = open('corresp.csv','w')
print( ','.join([' ']+keys), file=fout )
for row in range(track.shape[0]):
print( keys[row], file=fout, end=',')
print( ','.join(f"{track[row,i]}" for i in range(track.shape[1])), file=fout )
Here's the first few lines of the result:
,API,Backend Development,Bitcoin,Docker,Article Rewriting,Article writing,Blockchain,Content Writing,Ghostwriting,Android,Ethereum,PHP,React.js,C Programming,C++ Programming,ASIC,Digital ASIC Coding,Embedded Software,Article Writing,Blog,Copy Typing,Affiliate Marketing,Brand Marketing,Bulk Marketing,Sales,BlockChain,Business Strategy,Non-fungible Tokens,Technical Writing,.NET,Arduino,Software Architecture,Bluetooth Low Energy (BLE),C# Programming,Ada programming,Programming,Haskell,Rust,Algorithm,Java,Mathematics,Machine Learning (ML),Matlab and Mathematica,Data Entry,HTML,Circuit Designs,Embedded Systems,Electronics,Microcontroller, C++ Programming,Python
API,1.0,0.14285714285714285,0.5714285714285714,0.14285714285714285,0.0,0.0,0.2857142857142857,0.0,0.0,0.0,0.14285714285714285,0.0,0.14285714285714285,0.2857142857142857,0.2857142857142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Backend Development,0.6666666666666666,1.0,0.6666666666666666,0.6666666666666666,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bitcoin,0.21052631578947367,0.05263157894736842,1.0,0.05263157894736842,0.0,0.0,0.2631578947368421,0.0,0.0,0.05263157894736842,0.10526315789473684,0.10526315789473684,0.05263157894736842,0.15789473684210525,0.21052631578947367,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.05263157894736842,0.0,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Docker,0.6666666666666666,0.6666666666666666,0.6666666666666666,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I had a look at this by creating a pivot table in Excel for every combination of columns there are: AB AC, AD, BC, BD, CD and putting the unique entries from the first column, eg A, in the rows and the unique entries from the second, eg B, in the column and then putting column A in the values area, I find all matches and the count of all matches
This is a clunky method but I note from the Python based method that has been submitted, my answer is essentially no more or less clunky than that!
I have csv file like this:
F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,label
a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,L1
b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,L2
I want to have the combination of columns into the file as follows:
-For a column combination with the label and write the expected results to:
file1:
F1,label
a1,L1
b1,L2
file2:
F2,label
a2,L1
b2,L2
until
file10:
F10,label
a10,L1
b10,L2
-For 2 column combinations with the label and write the expected results to:
2C_file1:
F1,F2,label
a1,a2,L1
b1,b2,L2
2C_file2:
F1,F3,label
a1,a3,L1
b1,b3,L2
until
45C_file45:
F9,F10,label
a9,a10,L1
b9,b10,L2
-For 3 columns combinations with the label and write to 120 files:
.....until.....
-For 9 columns combinations with the label and write to 10 files:
-For 10 columns combinations with the label and write to 1 files:
I have searched and I found a python code for string combination with itertool.
How could I archive above tasks with python code?
import itertools as iters
text='ABCDEFGHIJ'
C1= iters.combinations(text,1)
print list(C1)
C2= iters.combinations(text,2)
print list(C2)
.....
C9= iters.combinations(text,9)
print list(C9)
C10=iters.combinations(text,10)
print list(10)
This loop structure should be able to create the structure you would like to have. It has to be changed to write to files. Here the length of the generated sequence or the position of the output that was generated is part of the same structure that you would like to print into a file:
#!/usr/bin/env python
row1 = ["F{i}".format(i=i) for i in range(1,11)]
row1.append("label")
row2 = ["a{i}".format(i=i) for i in range(1,11)]
row2.append("L1")
row3 = ["b{i}".format(i=i) for i in range(1,11)]
row3.append("L2")
for SequenceLength in range(1, len(row1)):
for SequencePositionStart in range(len(row1)):
if row1[SequencePositionStart:SequenceLength] == []:
continue
print(','.join(row1[SequencePositionStart:SequenceLength]), row1[-1], sep=",")
print(','.join(row2[SequencePositionStart:SequenceLength]), row2[-1], sep=",")
print(','.join(row3[SequencePositionStart:SequenceLength]), row3[-1], sep=",")
I got 3 datasets which contain the flow in m3/s per location. Dataset 1 is a 5 year ARI flood, Dataset 2 is a 20 year ARI flood and Dataset 3 is a 50 year ARI flood.
Per location I found the maximum discharge (5,20 & 50)
Code:
for key in Data_5_ARI_RunID_Flow_New.keys():
m = key
y5F_RunID = Data_5_ARI_RunID_Flow_New.loc[:,m]
y20F_RunID = Data_20_ARI_RunID_Flow_New.loc[:,m]
y50F_RunID = Data_50_ARI_RunID_Flow_New.loc[:,m]
max_y5F = max(y5F_RunID)
max_y20F = max(y20F_RunID)
max_y50F = max(y50F_RunID)
Max_DataID = m, max_y5F, max_y20F, max_y50F
print (Max_DataID)
The output is like this:
('G60_18', 44.0514, 47.625, 56.1275)
('Area5_11', 1028.4065, 1191.5946, 1475.9685)
('Area5_12', 1017.8286, 1139.2628, 1424.4304)
('Area5_13', 994.5626, 1220.0084, 1501.1483)
('Area5_14', 995.9636, 1191.8066, 1517.4541)
Now I want to export this result to a csv file, but I don't know how. I used this line of code, but it didn't work:
Max_DataID.to_csv(r'C:\Users\Max_DataID.csv', sep=',', index = False)
Use this file name myexample.csv with specific path where you want to create the file.
Please check that Max_DataID is a iterable value. And as your reference the values are in form of tuple so I use list() to convert tuples into list and that will be supported values for writerow in csv.
import csv
file = open('myexample.csv', 'wb')
filewriter = csv.writer(file,delimiter =',')
for data in Max_DataID:
filewriter.writerow(list(data))
You can do the following.
df.to_csv(file_name, sep='\t')
Also, if you want to split it into chunks, like 10,000 rows, or whatever, you can do this.
import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path_here/main.csv', chunksize=10000)):
chunk.to_csv('chunk{}.csv'.format(i))