How to efficiently 'query' multiple tsv files? - python

I've got about 40 tsv files, with the size of any given tsv ranging from 250mb to 3GB. I'm looking to pull data from the tsvs where rows contain certain values.
My current approach is far from efficient:
nums_to_look = ['23462346', '35641264', ... , '35169331'] # being about 40k values I'm interested in
all_tsv_files = glob.glob(PATH_TO_FILES + '*.tsv')
all_dfs = []
for file in all_tsv_files:
df = pd.read_csv(file, sep='\t')
# Extract rows which match values in nums_to_look
df = df[df['col_of_interest'].isin(nums_to_look)].reset_index(drop=True)
all_dfs.append(df)
Surely there's a much more efficient way to do this without having to read in every single file fully, and go through the entire file?
Any thoughts / insights would be much appreciated.
Thanks!

Related

Use different filters on different pandas data frames on each loop

I am learning python in order to automate some tasks at my work. What I need to do is to read many big TSV files and use filter each one of them on a list of keys that may be different from a TSV to another and then return the result in an Excel report using certain template. In order to do that I load at once all the common data to all the files: the list of TSV files to process, some QC values, the filter to be used for each TSV then in a loop I read all the files that I need to process and load one TSV on each loop round in pandas data frame then filter and fill the Excel report. The problem is that the filter does not seem to work and I am getting Excel files of the same length each time (number of rows)!!. I suspect that the data frame does not get reinitialized on each round but trying to del the df and releasing the memory do not work as well. How can I solve the issue? Is there a better data structure to handle my problem. Below a part of the code.
Thanks
# This how I load the filter for each file in each round of the the loop (as an array)
target = load_workbook(target_input)
target_sheet = target.active
targetProbes = []
for k in range (2, target_sheet.max_row+1):
targetProbes.append(target_sheet.cell(k, 1).value)
# This is how I load each TSV file in each round of the loop (an df)
tsv_reader = pd.read_csv(calls, delimiter='\t', comment = '#', low_memory=False)
# This is how I filter the df based on the filter
tsv_reader = tsv_reader[tsv_reader['probeset_id'].isin(targetProbes)]
tsv_reader.reset_index(drop = True, inplace = True)
# This how I try to del the df and clear the current filter so I can re-use them for another TSV file with different values in the next round of the loop
del tsv_reader
targetProbes.clear()

memory issue when merging two data frames

I am stuck at this second last statement clueless. The error is : numpy.core._exceptions.MemoryError: Unable to allocate 58.1 GiB for an array with shape (7791676634,) and data type int64
My thinking was that merging a data frame of ~12 million records with another data frame of 3-4 more columns should not be a big deal.
Please help me out. Totally stuck here. Thanks
Select_Emp_df has around 900k records
and Big_df has around 12 million records and 9 columns. I just need to merge two DFs like we do vlookup in Excel on key column.
import pandas as pd
Emp_df = pd.read_csv('New_Employee_df.csv', low_memory = False )
# Append data into one data frame from three csv files of 3 years'
transactions
df2019 = pd.read_csv('U21_02767G - Customer Trade Info2019.csv',
low_memory = False )
df2021 = pd.read_csv('U21_02767G - Customer Trade
Info2021(TillSep).csv', low_memory = False)
df2020 = pd.read_csv('Newdf2020.csv', low_memory = False)
Big_df = pd.concat([df2019, df2020, df2021], ignore_index=True)
Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']]
Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY')
print (Big_df.info)
Just before Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY')
try to delete previous dataframes. Like this.
del df2019
del df2020
del df2021
This should save some memory
also try
Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']].drop_duplicates(subset=['CUSTKEY'])
When younger, I used machines where available RAM per process where 32k to 640k. And I used to process huge datasets on that (err... several Mo but much larger than memory). The key was to only keep in memory what was required.
Here you concat 3 large dataframes to later merge that with another one. If you have memory issues, just reverse concat and merging: merge each individual file with Emp_df and immediately write the merged file to the disk an throw everything out of your memory between each step. If you use csv files, you can even directly build the contatenated csv file by appending the 2nd and 3rd merge files to the first one (use mode='a', header=False in to_csv method).
Using suggestions provided by community here and a bit of my research I edited my codes and here is what worked for me -
Select_Emp_df['LSTTRDDT'] = pd.to_datetime(Select_Emp_df['LSTTRDDT'],errors = 'coerce') Select_Emp_df = Select_Emp_df.sort_values(by='LSTTRDDT',ascending=True) Select_Emp_df = Select_Emp_df.drop_duplicates(subset='CUSTKEY',keep='last')
I just sorted values on Last transaction date and deleted duplicates(IN CUSTKEY) in Select_Emp_df data frame.

How to split an inhomogeneous dataset into several files according to some specific keywords in the dataset?

I have a .dat file that shows results of a lab experiment at different conditions (you can find the dataset here). The data file is inhomogeneous, meaning that the number of rows and columns differ from one place to another, just as follows:
The dataset has also Table 2, Table 3, .. , Table 7.
So, what i need to do is to split this dataset into multiple files according to each table of records, i have tried the following code:
import pandas as pd
import os
f = open('BCZT015Fe_21-120.dat', 'r')
databaseRaw = f.read()
database = databaseRaw.split('Table ')
f.close()
for i in range(0, len(database) - 1) :
database[i] = 'Table '+''.join(database[i])
for i in range(0, len(database)) :
f = open("output" + str(i) + "" + ".csv" ,"w")
f.writelines(database[i])
f.close()
os.remove("output0.csv")
os.remove("output1.csv")
os.remove("output2.csv")
My idea here is trying to save the data in a list, and iterate over the whole database and whenever the word "table" is detected, I just split the data before and after. The problem with this approach is that the resulting files have the data stored into just one column (which is understandable since i am saving the dataset in just one list), so, how can i split such a dataset into multiple files and keep the data from the original file as it is in the separated files?
Hope someone could help me out with this problem, and Thank you in advance!

i want to create matrix from csv file

I want to create a matrix from CSV file.
Here's what I've tried:
df = pd.read_csv('csv-path', usecols=[0,1], names=['A', 'B'])
pd.pivot_table(df,columns='A', values='B')
output : [9197337 rows x 2 columns].
I want to take fewer rows like I want to make a matrix of first 100 entries or 1000. How can I do that?
Since the csv module only deals in complete files, it would be easiest to extract the lines of interest before you use it. You could do this before running your program with the Unix head utility. Here's one way that should work in Python:
with open("csv-path") as inf, open("mod_csv_path", "w") as outf:
for i in range(1000):
outf.write(inf.readline())
Obviously you'd then read "mod_csv_path" rather than "csv-path' as the input file.
Pandas seems to be the right approach ? Can you provide a sample of your CSV file.
Also, with pandas, you can limit the size of your dataframe:
limited_df = df.head(num_elements)

Accessing data in chunks using Python Pandas

I have a large text file, separated by semi-column. I am trying to retrieve the values of a column (e.g. the second column) and work on it iteratively using numpy. An example of data contained in a text file is given below:
10862;2;1;1;0;0;0;3571;0;
10868;2;1;1;1;0;0;3571;0;
10875;2;1;1;1;0;0;3571;0;
10883;2;1;1;1;0;0;3571;0;
...
11565;2;1;1;1;0;0;3571;0;
11572;2;1;1;1;0;0;3571;0;
11579;2;1;1;1;0;0;3571;0;
11598;2;1;1;1;0;0;3571;0;
11606;2;1;1;
Please note that the last line may not contain the same number of values as the previous ones.
I am trying to use pandas.read_csv to read this large file by chunks. For the purpose of the example, let's assume that the chunk size is 40.
I have tried so far 2 different approaches:
1)
Set nrows, and iteratively increase the skiprows so as to read the entire file by chunk.
nrows_set = 40
n_it = 0
while(1):
df = pd.read_csv(filename, nrows=nrows_set , sep=';',skiprows = n_it * nrows_set)
vect2 = df[1] # trying to access the values of the second column -- works
n_it = n_it+1
Issue when accessing the end of the file: Pandas generates an error when ones tries to read a number of rows bigger than the number of rows contained in the file.
For example, if the file contains 20 lines, and nrows is set as 40, the file cannot be read. My first approach hence generated a bug when I was trying to read the last 40 lines of my file, when less than 40 lines were remaining.
I do not know how to check for the end of file before trying to read from the file - and I do not want to load the entire file to obtain the total rows number since the file is large. Hence, I tried a second approach.
2)
Set chunksize. This works well, however I have an issue when I then try to acess the data in chunk:
reader = pd.read_csv(filename, chunksize=40, sep=';')
for chunk in reader :
print(chunk) # displays data -- the data looks correct
chunk[1] # trying to access the values of the second column -- generates an error
What is the data type of chunk, and how can I convert it so as this operation works?
Alternatively, how can I retrieve the number of lines contained in a file without loading the entire file in memory (solution 1))?
Thank you for your help!
Gaelle
chunk is a data frame.
so you can access it using indexers (accesors) like .ix / .loc / .iloc / .at / etc.:
chunk.ix[:, 'col_name']
chunk.iloc[:, 1] # second column

Categories

Resources