Searching a large csv - python

I have a csv reference data file of around 1m rows. I have a csv data file of 3m rows. I need to perform a reference data lookup for each of the 3m rows into the 1m row csv file.
For various reasons I am constrained to python and cvs. I have tried to have the 1m row table in a panda in memory but the whole thing is very slow.
Can someone recommend an alternative approach?

As I mentioned above, a good solution to this type of thing would be to dump the CSV into a sqlite db and the just query as needed :)

Here is one idea.
import csv
# Asks for search criteria from user
search_parts = input("Enter search criteria:\n").split(",")
# Opens csv data file
file = csv.reader(open("C:\\your_path_here\\test.csv"))
# Go over each row and print it if it contains user input.
for row in file:
if all([x in row for x in search_parts]):
print(row)

Related

Replace a row in a pandas dataframe with values from dictionary

I am trying to populate an empty dataframe by using the csv module to iterate over a large tab-delimited file, and replacing each row in the dataframe with these values. (Before you ask, yes I have tried all the normal read_csv methods, and nothing has worked because of dtype issues and how large the file is).
I first made an empty numpy array using np.empty, using the dimensions of my data. I then converted this to a pandas DataFrame. Then, I did the following:
with open(input_file) as csvfile:
reader = csv.DictReader(csvfile,delimiter='\t')
row_num = 0
for row in reader:
for key, value in row.items():
df.loc[row_num,key] = value
row_num += 1
This is working great, except that my file has 900,000 columns, so it is unbelievably slow. This also feels like something that pandas could do more efficiently, but I've been unable to find how. The dictionary for each row given by DictReader looks like:
{'columnName1':<value>,'columnName2':<value> ...}
Where the values are what I want to put in the dataframe in those columns for that row.
Thanks!
So what you could do in this case is to build smaller chunks of your big csv data file. I had the same issue with a 32GB Csv-File, so I had to build chunks. After reading them in you could work with them.
# read the large csv file with specified chunksize
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
chunksize=1000000 sets how many row are read in at once
Helpfull website:
https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c

Split large CSV to Multiple CSV containing each row

I am using Pandas to split large csv to multiple csv each containing single row.
I have a csv having 1 million records and using below code it is taking to much time.
For Eg: In the above case there will be 1 million csv created.
Anyone can help me how to decrease time in splitting csv.
for index, row in lead_data.iterrows():
row.to_csv(row['lead_id']+".csv")
lead_data is the dataframe object.
Thanks
You don't need to loop through the data. Filter records by lead_id and the export the data to CSV file. That way you will be able to split the files based on the lead ID (assuming).
Example, split all EPL games where arsenal was at home:
data=pd.read_csv('footbal/epl-2017-GMTStandardTime.csv')
print("Selecting Arsenal")
ft=data.loc[data['HomeTeam']=='Arsenal']
print(ft.head())
# Export data to CSV
ft.to_csv('arsenal.csv')
print("Done!")
This way it is much faster than using one record at a time.

Pandas: How to read rows from CSV or Excel file?

It seems that you can look at columns in a file no problem, but there's no apparent way to look at rows. I know I can read the entire file (CSV or excel) into a crazy huge dataframe in order to select rows, but I'd rather be able to grab particular rows straight from the file and store those in a reasonably sized dataframe.
I do realize that I could just transpose/pivot the df before saving it to the aforementioned CVS/Excel file. This would be a problem for Excel because I'd run out of columns (the transposed rows) far too quickly. I'd rather use Excel than CSV.
My original, not transposed data file has 9000+ rows and 20ish cols. I'm using Excel 2003 which supports up to 256 columns.
EDIT: Figured out a solution that works for me. It's a lot simpler than I expected. I did end up using CSV instead of Excel (I found no serious difference in terms of my project) Here it is for whoever may have the same problem:
import pandas as pd
selectionList = (2, 43, 792, 4760) #rows to select
df = pd.read_csv(your_csv_file, index_col=0).T
selection = {}
for item in selectionList:
selection[item] = df[item]
selection = pd.DataFrame.from_dict(selection)
selection.T.to_csv(your_path)
I think you can use the skiprows and nrows arguments in pandas.read_csv to pick out individual rows to read in.
With skiprows, you can provide it a long list (0 indexed) of rows not to import , e.g. [0,5,6,10]. That might end up being a huge list though. If you provide it a single integer, it will skip that number of rows and start reading. Set nrows to whatever to pick up the number of rows you want at the point where you have it start.
If I've misunderstood the issue, let me know.

Reading from a specific row/column from and excel csv file

I am a beginner at Python and I'm looking to take 3 specific columns starting at a certain row from a .csv spreadsheet and then import each into python.
For example
I would need to take 1000 rows worth of data from column F starting at
row 12.
I've looked at options using cvs and pandas but I can't figure out how
to have them start importing at a certain row/column.
Any help would be greatly appreciated.
If the spreadsheet is not huge, the easiest approach is to load the entire CSV file into Python using the csv module and then extract the required rows and columns. For example:
import csv
rows = list(csv.reader(file('Book1.csv', 'rb')))
data = [column[5] for column in rows[11:11+1000]]
will do the trick. Remember that Python starts numbering from 0, so column[5] is column F from your spreadsheet and rows[11] is row 12.
CSV files being text files, there is no way to read a certain line. You will have to read line per line, and count... Have a look at the csv module in Python, which will explain how to (easily) read lines. Particularly this section.

handling a huge file with python and pytables

simple problem, but maybe tricky answer:
The problem is how to handle a huge .txt file with pytables.
I have a big .txt file, with MILLIONS of lines, short lines, for example:
line 1 23458739
line 2 47395736
...........
...........
The content of this .txt must be saved into a pytable, ok, it's easy. Nothing else to do with the info in the txt file, just copy into pytables, now we have a pytable with, for example, 10 columns and millions of rows.
The problem comes up when, with the content in the txt file, 10 columns x millions lines are directly generated in the paytable BUT, depending on the data on each line of the .txt file, new colums must be created on the pytable. So how to handle this efficiently??
Solution 1: first copy all the text file, line by line into pytable (millions), and then iterate over each row on pytable (millions again) and, depending on the values, generate the new columns needed for the pytable.
Solution 2: read line by line the .txt file, do whatever needed, calculate the new needed values, and then send all the info to a pyrtable.
Solution 3:.....any other efficient and faster solution???
I think that basic problem here is one of the conceptual model. PyTables' Tables only handle regular (or structured) data. However, the data that you have is irregular or unstructured in that the structure is determined as you read the data. Said another way, PyTables needs the column description to be known completely by the time that create_table() is called. There is no way around this.
Since in your problem statement any line may add a new column you have no choice but to do this in two full passes through the data: (1) read through the data and determine the columns and (2) write the data to the table. In pseudocode:
import tables as tb
cols = {}
# discover columns
d = open('data.txt')
for line in d:
for col in line:
if col not in cols:
cols['colname'] = col
# write table
d.seek(0)
f = tb.open_file(...)
t = f.create_table(..., description=cols)
for line in d:
row = line_to_row(line)
t.append(row)
d.close()
f.close()
Obviously, if you knew the table structure ahead of time you could skip the first loop and this would be much faster.

Categories

Resources