How to pre-process a very large data in python

How to pre-process a very large data in python - python

I have a couple of files 100 MB each. The format for those files looks like this:
0 1 2 5 8 67 9 122
1 4 5 2 5 8
0 2 1 5 6
.....
(note the actual file does not have the alignment spaces added in, only one space separates each element, added alignment for aesthetic effect)
this first element in each row is it's binary classification, and the rest of the row are indices of features where the value is 1. For instance, the third row says the row's second, first, fifth and sixth features are 1, the rest are zeros.
I tried to read each line from each file, and use sparse.coo_matrix create a sparse matrix like this:
for train in train_files:
with open(train) as f:
row = []
col = []
for index, line in enumerate(f):
record = line.rstrip().split(' ')
row = row+[index]*(len(record)-4)
col = col+record[4:]
row = np.array(row)
col = np.array(col)
data = np.array([1]*len(row))
mtx = sparse.coo_matrix((data, (row, col)), shape=(n_row, max_feature))
mmwrite(train+'trans',mtx)
but this took forever to finish. I started reading the data at night, and let the computer run after I went to sleep, and when I woke up, it still haven't finish the first file!
What are the better ways to process this kind of data?

I think this would be a bit faster than your method because it does not read file line by line. You can try this code with a small portion of one file and compare with your code.
This code also requires to know the feature number in advance. If we don't know the feature number, it would require another line of code which was commented out.
import pandas as pd
from scipy.sparse import lil_matrix
from functools import partial
def writeMx(result, row):
# zero-based matrix requires the feature number minus 1
col_ind = row.dropna().values - 1
# Assign values without duplicating row index and values
result[row.name, col_ind] = 1
def fileToMx(f):
# number of features
col_n = 136
df = pd.read_csv(f, names=list(range(0,col_n+2)),sep=' ')
# This is the label of the binary classification
label = df.pop(0)
# Or get the feature number by the line below
# But it would not be the same across different files
# col_n = df.max().max()
# Number of row
row_n = len(label)
# Generate feature matrix for one file
result = lil_matrix((row_n, col_n))
# Save features in matrix
# DataFrame.apply() is usually faster than normal looping
df.apply(partial(writeMx, result), axis=0)
return(result)
for train in train_files:
# result is the sparse matrix you can further save or use
result = fileToMx(train)
print(result.shape, result.nnz)
# The shape of matrix and number of nonzero values
# ((420, 136), 15)

Related

Create matrix from indices and value points

I want to read a text file with values of matrix. Let's say you have got a .txt file looking like this:
0 0 4.0
0 1 5.2
0 2 2.1
1 0 2.1
1 1 2.9
1 2 3.1
Here, the first column gives the indices of the matrix on the x-axis and the second column fives the indices of the y-axis. The third column is a value at this position in the matrix. When values are missing the value is just zero.
I am well aware of the fact, that data formats like the .mtx format exist, but I would like to create a scipy sparse matrix or numpy array from this txt file alone instead of adjusting it to the .mtx file format. Is there a Python function out there, which does this for me, which I am missing?

import numpy
with open('filename.txt','r') as f:
lines = f.readlines()
f.close()
data = [i.split(' ') for i in lines]
z = list(zip(*data))
row_indices = list(map(int,z[0]))
column_indices = list(map(int,z[1]))
values = list(map(float,z[2]))
m = max(row_indices)+1
n = max(column_indices)+1
p = max([m,n])
A = numpy.zeros((p,p))
A[row_indices,column_indices]=values
print(A)
If you want a square matrix with maximum of column 1 as the number of rows and and the maximum of column 2 to be the size, then you can remove p = max([m,n]) and replace A = numpy.zeros((p,p)) with A = numpy.zeros((m,n)).

Starting from the array (a) sorted on the first column (major) and second (minor) as in your example, you can reshape:
# a = np.loadtxt('filename')
x = len(np.unique(a[:,0]))
y = len(np.unique(a[:,1]))
a[:,2].reshape(x,y).T
Output:
array([[4. , 2.1],
[5.2, 2.9],
[2.1, 3.1]])

Pandas very slow query

I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2

From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.

Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?

Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

reading variable number of columns with Python

I need to read a variable number of columns from my input file ( the number of columns is defined by the user, there's no limitation ). For every column I have multiple variables to read, three in my case, set by the user as well.
So the file to read is like:
2 3 5
6 7 9
3 6 8
In Fortran this is really easy to do:
DO 180 I=1,NMOD
READ(10,*) QARR(I),RARR(I),WARR(I)
NMOD is defined by the user, as well as all the values in the example. All of them are input parameters to be stored in memory. By doing these I can save all the variables I need and I can use it whenever I want, recalling them by changing the I index. How can I obtain the same result with Python?

Example file 'text'
2 3 5
6 7 9
3 6 8
Python code
data = []
with open('text') as file:
columns_to_read = 1 # here you tell how many columns you want to read per line
for line in file:
data.append(list(map(int, line.split()[:columns_to_read])))
print(data) # print: [[2], [6], [3]]
data will hold an array of arrays that represent your lines.

from itertools import islice
with open('file.txt', 'rt') as f:
# default slice from row 0 until end with step 1
# example islice(10, 20, 2) take only row 10,12,14,16,18
dat = islice(f, 0, None, 1)
column = None # change column here, default to all
# this keep the list value as string
# mylist = [i.split() for i in dat]
# this keep the list value as int
mylist = [[int(j) for j for i.split()[:column] for i in dat]
Code above construct 2d list
access with mylist[row][column]
Example - mylist[2][3] access row 2 column 3
Edit : improve code efficiency with #Guillaume #Javier suggestion

Python: read timesteps from csv to arrays: Post-processing model-data with numpy;

I am still trying to come around with python, but this problem exceeds my knowledge:
Topic: hydrodynamic postprocessing:
csv output of hydraulic software to array, split timesteps
Here is the data and how far i came with a working code:
Input-file (see below):
First row: Number of result-nodes
Second row: Header
Third row: timestep # time=
Following: all results of this timestep (in this file: 13541 nodes, variable)
....the same again for next timestep.
# Number of Nodes: 13541
#X Y Z depth wse
# Output at t = 0
5603.7598 4474.4902 37.470001 0 37.470001
5610.5 4461.6001 36.020001 0 36.020001
5617.25 4448.71 35.130001 0 35.130001
5623.9902 4435.8198 35.07 0 35.07
5630.7402 4422.9199 35.07 0 35.07
5761.5801 4402.79 35.369999 0 35.369999
COMMENT:....................13541 timesteps...........
# Output at t = 120.04446
5603.7598 4474.4902 37.470001 3.6977223 41.167724
5610.5 4461.6001 36.020001 4.1377293 40.15773
5617.25 4448.71 35.130001 3.9119012 39.041902
5623.9902 4435.8198 35.07 3.7923947 38.862394
5630.7402 4422.9199 35.07 3.998436 39.068436
5761.5801 4402.79 35.369999 3.9750571 39.345056
COMMENT:....................13541 timesteps...........
# Output at t = 240.06036
5603.7598 4474.4902 37.470001 11.131587 48.601588
5610.5 4461.6001 36.020001 12.564266 48.584266
5617.25 4448.71 35.130001 13.498463 48.628464
5623.9902 4435.8198 35.07 13.443041 48.513041
5630.7402 4422.9199 35.07 11.625824 46.695824
5761.5801 4402.79 35.369999 19.49551 54.865508
PROBLEM:
I need a loop, which reads in n-timesteps into arrays.
The result should be: array for each timestep: in this case 27 timesteps with 13541 elements each.
timestep_1=[all elements of this timestep: shape=13541,5]
timestep_2=[]
timestep_3[]
........
timestep_n=[]
My code so far:
import numpy as np
import csv
from numpy import *
import itertools
#read file to big array
array=np.array([row for row in csv.reader(open("ascii-full.csv", "rb"), delimiter='\t')])
firstRow=array[0]
secondRow=array[1]
# find out how many nodes
strfirstRow=' '.join(map(str,firstRow))
first=strfirstRow.split()
print first[4]
nodes=first[4]
nodes=float(nodes)
#count timesteps
temp=(len(array)-3)/nodes
timesteps=int(temp)+1
#split array into timesteps:
# X Y Z h(t1) h(t2) h(tn)
ts1=array[3:nodes+3]#13541
#print ts1
ts2=array[nodes+4:nodes*2+4]
#print ts2
.......
read ts3 to last timestep to arrays with loop....
Maybe someone can help me, thanks!!!

You can use np.genfromtxt() to get a 3-D array like:
import numpy as np
gen = (a for a in open('test.txt') if not a[0] in ['#', 'C'])
a = np.genfromtxt(gen).reshape(-1, 6, 5)
where a[i] will represent the output at timestep i.

My take on your problem is, instead of reading the whole file into an array and process the array, read it line by line, creating the arrays as the data is read.
I read the number of rows and columns per timestep as described in the file, then create a new array for each timestep read (adding it to a list), then populating it with the read data.
import numpy as np
timesteps = []
timestep_results = []
f = open("ascii-full.csv", "rb")
# First line is number of rows (not counting the initial #)
rows = int(f.readline().strip()[1:].split()[-1])
counter = 0
# Second line is number of columns
columns = len(f.readline().strip().split())
# Next lines
for line in f:
if line.startswith("#"):
# it's a header: add time to timestep list, begin new array
timesteps.append( float(line.strip().split("=")[1]) )
timestep_results.append( np.zeros((rows, columns)) )
counter = 0
else:
# it's data: add to array in appropiate row
timestep_results[-1][counter] = map(float, line.strip().split())
counter += 1
f.close()
Hope it helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pre-process a very large data in python - python

Related

Create matrix from indices and value points

Pandas very slow query

Slicing my data frame is returning unexpected results

reading variable number of columns with Python

Python: read timesteps from csv to arrays: Post-processing model-data with numpy;

Categories

Resources