xlrd data extraction python - python

I am working on data extraction using xlrd and I have extracted 8 columns of inputs for my project. Each column of data has around 100 rows. My code is as follows:
wb = xlrd.open_workbook('/Users/Documents/Sample_data/AI_sample.xlsx')
sh = wb.sheet_by_name('Sample')
x1 = sh.col_values( + 0)[1:]
x2 = sh.col_values( + 1)[1:]
x3 = sh.col_values( + 2)[1:]
x4 = sh.col_values( + 3)[1:]
x5 = sh.col_values( + 4)[1:]
x6 = sh.col_values( + 5)[1:]
x7 = sh.col_values( + 6)[1:]
x8 = sh.col_values( + 7)[1:]
Now I want to create an array of inputs which gives each row of the 8 columns.
For eg: if this is my 8 columns of data
x1 x2 x3 x4 x5 x6 x7 x8
1 2 3 4 5 6 7 8
7 8 6 5 2 4 8 8
9 5 6 4 5 1 7 5
7 5 6 3 1 4 5 6
i want something like: x1, x2, x3, x4, x5, x6 ([1,2,3,4,5,6,7,8]) for all the 100+ rows.
I could have done a row wise extraction but, doing that for 100+ rows is practically very difficult. So how do i do that. i also understand that it could be done using np.array. but i do not know how.

You can also try openpyxl something similar to xlrd
from openpyxl import load_workbook,Workbook
book = load_workbook(filename=file_name)
sheet = book['sheet name']
for row in sheet.rows:
col_0 = row[0].value
col_1 = row[1].value
I used to prefer openpyxl instead of xlrd

I found this piece of code very useful
X = np.array([x1, x2, x3, x4, x5, x6, x7, x8])
return X.T

Related

VTK formatted output while each point lies at a new line

I aim to generate a .vtk format file with N POINT and M POLYGON data.
The formal output function is listed as below where polymesh represents the vtk.vtkPolyData() containing POINT and POLYGON
writer = vtk.vtkPolyDataWriter()
writer.SetFileTypeToASCII()
writer.SetInputData(polymesh)
writer.SetFileName(filename)
writer.Write()
Here is my concern
The output is shown as
...
POINTS N doubles
X0 Y0 Z0 X1 Y1 Z1 X2 Y2 Z2
X3 Y3 Z3 ...
...
POLYGONS
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Is there any solution that I could make each point shown at a new line and each polygon shown at a new line as well. For example, the expected output should be
...
POINTS N doubles
X0 Y0 Z0
X1 Y1 Z1
X2 Y2 Z2
X3 Y3 Z3
...
POLYGONS
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
There is no such option on the writer class.
I see only two solutions:
create your own writer
post process your file

How to code the string data in a column so that I could apply machine learning techniques for classification, for example k-means?

I have string variables (Range[VarName]) in a column with respective ID (Range[kksId]). I need to create an algorithm that will classify new variables to existing ID or if it is not possible put them separately in N/A class.
How to code the string data in a column so that I could apply machine learning techniques for classification, for example k-means?
Generally, since your variable "Range[kksId]" is your target class, you map each of theese strings to a unique integer number, here's an example of how that could be achieved in python:
import pandas as pd
def _categoricalToNumeric(dataset):
categoric_id_mapping = {}
curr_id_to_assign = 0
for row in dataset.index:
categorical_value = dataset.loc[row]
if categorical_value in categoric_id_mapping:
dataset.loc[row] = categoric_id_mapping[categorical_value]
else:
categoric_id_mapping[categorical_value] = curr_id_to_assign
dataset.loc[row] = curr_id_to_assign
curr_id_to_assign += 1
return dataset
df = pd.read_excel('DataModel.xlsx', index_col=0)
df['Range[kksId]'] = _categoricalToNumeric(df['Range[kksId]'])
Then, as for the string feature, in a simple classifier, they are generally mapped each caracter into a variable. Example:
R_r_DegPit1_In_St
R_r_DegPit1_In
becomes:
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16
R _ r _ D e g P i t 1 _ I n _ S t
R _ r _ D e g P i t 1 _ I n \0 \0 \0
Since you will have as many variables as the longest string in your dataset, for the strings which will not occupy all variables you should fill the remaining variables with a value indicating the empty character. You should also change the character values to numeric, however, it is important not to reset the numeric counting based on each column. The result could be something like this:
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16
3 1 4 1 5 10 11 6 12 13 2 1 7 14 1 8 9
3 1 4 1 5 10 11 6 12 13 2 1 7 14 0 0 0
Keep in mind that more advanced ML/DL techniques handles their strings in different ways.

Create a new dataframe with k copies of each row appended to itself

Suppose I have a dataframe with n rows:
Index data1 data2 data3
0 x0 x0 x0
1 x1 x1 x1
2 x2 x2 x2
...
n xn xn xn
How do I create a new dataframe (using pandas) with k copies of each row appended to itself:
Index data1 data2 data3
0 x0 x0 x0
1 x0 x0 x0
...
k-1 x0 x0 x0
k x1 x1 x1
k+1 x1 x1 x1
...
2k-1 x1 x1 x1
2k x2 x2 x2
...
First concat, then sort
The method I'd use is to create a list of duplicate dataframes, concat them together, and then sort_index:
count = 5
new_df = pd.concat([df]*count).sort_index()
Using numpy.repeat and .iloc In here, k=2
df.iloc[np.repeat(np.arange(len(df)), 3)]
Out[256]:
Index data1 data2 data3
0 0 x0 x0 x0
0 0 x0 x0 x0
0 0 x0 x0 x0
1 1 x1 x1 x1
1 1 x1 x1 x1
1 1 x1 x1 x1
2 2 x2 x2 x2
2 2 x2 x2 x2
2 2 x2 x2 x2
Option 1
Use repeat + reindex + reset_index:
df
data1 data2 data3
0 x0 x0 x0
1 x1 x1 x1
2 x2 x2 x2
df.reindex(df.index.repeat(5)).reset_index(drop=1)
data1 data2 data3
0 x0 x0 x0
1 x0 x0 x0
2 x0 x0 x0
3 x0 x0 x0
4 x0 x0 x0
5 x1 x1 x1
6 x1 x1 x1
7 x1 x1 x1
8 x1 x1 x1
9 x1 x1 x1
10 x2 x2 x2
11 x2 x2 x2
12 x2 x2 x2
13 x2 x2 x2
14 x2 x2 x2
Option 2
Similar solution with repeat + pd.DataFrame:
pd.DataFrame(np.repeat(df.values, 5, axis=0), columns=df.columns)
data1 data2 data3
0 x0 x0 x0
1 x0 x0 x0
2 x0 x0 x0
3 x0 x0 x0
4 x0 x0 x0
5 x1 x1 x1
6 x1 x1 x1
7 x1 x1 x1
8 x1 x1 x1
9 x1 x1 x1
10 x2 x2 x2
11 x2 x2 x2
12 x2 x2 x2
13 x2 x2 x2
14 x2 x2 x2
Comparisons
%timeit pd.concat([df] * 100000).sort_index().reset_index(drop=1)
1 loop, best of 3: 14.6 s per loop
%timeit df.iloc[np.repeat(np.arange(len(df)), 100000)].reset_index(drop=1)
10 loops, best of 3: 22.6 ms per loop
%timeit df.reindex(df.index.repeat(100000)).reset_index(drop=1)
10 loops, best of 3: 19.9 ms per loop
%timeit pd.DataFrame(np.repeat(df.values, 100000, axis=0), columns=df.columns)
100 loops, best of 3: 17.1 ms per loop

Rearrange data in csv with Python

I have a .csv file with the following format:
A B C D E F
X1 X2 X3 X4 X5 X6
Y1 Y2 Y3 Y4 Y5 Y6
Z1 Z2 Z3 Z4 Z5 Z6
What I want:
A X1
B X2
C X3
D X4
E X5
F X6
A Y1
B Y2
C Y3
D Y4
E Y5
F Y6
A Z1
B Z2
C Z3
D Z4
E Z5
F Z6
I am unable to wrap my mind around the built-in transpose functions in order to achieve the final result. Any help would be appreciated.
You can simply melt your dataframe using pandas:
import pandas as pd
df = pd.read_csv(csv_filename)
>>> pd.melt(df)
variable value
0 A X1
1 A Y1
2 A Z1
3 B X2
4 B Y2
5 B Z2
6 C X3
7 C Y3
8 C Z3
9 D X4
10 D Y4
11 D Z4
12 E X5
13 E Y5
14 E Z5
15 F X6
16 F Y6
17 F Z6
A pure python solution would be as follows:
file_out_delimiter = ',' # Use '\t' for tab delimited.
with open(filename, 'r') as f, open(filename_out, 'w') as f_out:
headers = f.readline().split()
for row in f:
for pair in zip(headers, row.split()):
f_out.write(file_out_delimiter.join(pair) + '\n')
resulting in the following file contents:
A,X1
B,X2
C,X3
D,X4
E,X5
F,X6
A,Y1
B,Y2
C,Y3
D,Y4
E,Y5
F,Y6
A,Z1
B,Z2
C,Z3
D,Z4
E,Z5
F,Z6

Rows that appear in two separate txt files, remove from one txt file

Pretty new to python. My issue is that I have one txt file ('A.txt') that has a bunch of columns in it and a second txt file ('B.txt) that has different data. However, some of the data that's in B also shows up in A. Example:
A.txt:
name1 x1 y1
name2 x2 y2
name3 x3 y3
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
B.txt
namea xa ya
name2 x2 y2
name3 x3 y3
nameb xb yb
namec xc yc
...
I want everything in B.txt that shows up in A.txt to be removed from A.txt
I realize this has been asked before, but I have tried the advice that was given to people asking a similar question, but it doesn't work for me.
So far I have:
tot = 0
with open('B.txt', 'r') as f1:
for a in f1:
WR = a.strip().split()
with open('A.txt', 'r+') as f2:
for b in f2:
l = b.strip().split()
if WR not in l:
print l
tot += 1
#I've done it the following way and also doesn't give the
#output I need
#if WR == l: #find duplicates
# continue
#else:
# print l
print tot
When I run this I get back what I think is the answer (file A has 2060 file B has 154) but repeated 154 times.
So example of what I mean is:
A.txt:
name1 x1 y1
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
name1 x1 y1
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
name1 x1 y1
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
name1 x1 y1
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
I only want it to look like:
A.txt:
name1 x1 y1
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
Like I've said, I've already looked at the other similar questions and tried what they did and it's giving me this repeating answer. Any suggestions would greatly be appreciated!
I just copied your A and B files so just check and let me know if its correct
tot = 0
f1 = open('B.txt', 'r')
f2 = open('A.txt', 'r')
lista = []
for line in f1:
line.strip()
lista.append(line)
listb = []
for line in f2:
line.strip()
listb.append(line)
for b in listb:
if b in lista:
continue
else:
print(b)
tot+=1
This code prints the line, but if you want you can write it to a file too

Categories

Resources