I have a very large csv file with millions of rows and a list of the row numbers that I need.like
rownumberList = [1,2,5,6,8,9,20,22]
I know there is something called skiprows that helps to skip several rows when reading csv file like that
df = pd.read_csv('myfile.csv',skiprows = skiplist)
#skiplist would contain the total row list deducts rownumberList
However, since the csv file is very large, directly selecting the rows that I need could be more efficient. So I was wondering are there any methods to select rows when using read_csv? Not try to select rows using dataframe afterwards, since I try to minimize the time of reading file.Thanks.
There is a parameter called nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files (Docs)
pd.read_csv(file_name,nrows=int)
In case you need some part in the middle. Use both skiprows as well as nrows in read_csv.if skiprows indicate the beginning rows and nrows will indicate the next number of rows after skipping eg.
Example:
pd.read_csv('../input/sample_submission.csv',skiprows=5,nrows=10)
This will select data from the 6th row to 16 row
Edit based on comment:
Since there is a list this one might help i.e
li = [1,2,3,5,9]
r = [i for i in range(max(li)) if i not in li]
df = pd.read_csv('../input/sample_submission.csv',skiprows=r,nrows= max(li))
# This will skip the rows you dont want as well as limit the number of rows to maximum of the list.
import pandas as pd
rownumberList = [1,2,5,6,8,9,20,22]
df = pd.read_csv('myfile.csv',skiprows=lambda x: x not in rownumberList)
for pandas 0.25.1, pandas read_csv, you can pass callable function to skiprows
I am not sure about read_csv() from Pandas (there is though a way to use an iterator for reading a large file in chunks), but you can read the file line by line (lazy-loading, not reading the whole file in memory) with csv.reader (or csv.DictReader), leaving only the desired rows with the help of enumerate():
import csv
import pandas as pd
DESIRED_ROWS = {1, 17, 28}
with open("input.csv") as input_file:
reader = csv.reader(input_file)
desired_rows = [row for row_number, row in enumerate(reader)
if row_number in DESIRED_ROWS]
df = pd.DataFrame(desired_rows)
(assuming you would like to pick random/discontinuous rows and not a "continuous chunk" from somewhere in the middle - in that case #James's idea to have "start and "stop" would work generally better).
import pandas as pd
df = pd.read_csv('Data.csv')
df.iloc[3:6]
Returns rows 3 through 5 and all columns.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
From de documentation you can see that skiprows can take an integer or a list as values to remove some lines.
So basicaly you can tell it to remove all but those you want. For this you first need to know the number in lines in the file (best if you know beforehand) by open it and counting as following:
with open('myfile.csv') as f:
row_count = sum(1 for row in f)
Now you need to create the complementary list (here are sets but also works, don't know why). First you create the one from 1 to the number of rows and then substract the numbers of the rows you want to read.
skiplist = set(range(1, row_count+1)) - set(rownumberList)
Finally you can read the csv as normal.
df = pd.read_csv('myfile.csv',skiprows = skiplist)
here is the full code:
import pandas as pd
with open('myfile.csv') as f:
row_count = sum(1 for row in f)
rownumberList = [1,2,5,6,8,9,20,22]
skiplist = set(range(1, row_count+1)) - set(rownumberList)
df = pd.read_csv('myfile.csv', skiprows=skiplist)
you could try this
import pandas as pd
#making data frame from a csv file
data = pd.read_csv("your_csv_flie.csv", index_col ="What_you_want")
# retrieving multiple rows by iloc method
rows = data.iloc [[1,2,5,6,8,9,20,22]]
You will not be able to circumvent the read time when accessing a large file. If you have a very large CSV file, any program will need to read through it at least up to the point where you want to begin extracting rows. Really, that is what databases are designed for.
However, if you want to extract rows 300,000 to 300,123 from a 10,000,000 row CSV file, you are better off reading just the data you need into Python before converting it to a data frame in Pandas. For this you can use the csv module.
import csv
import pandas
start = 300000
stop = start + 123
data = []
with open('/very/large.csv', 'r') as fp:
reader = csv.reader(fp)
for i, line in enumerate(reader):
if i >= start:
data.append(line)
if i > stop:
break
df = pd.DataFrame(data)
for i in range (1,20)
the first parameter is the first row and the last parameter is the last row...
Related
For a project I have devices who send payloads and I should store them on a localfile, but I have memory limitation and I dont want to store more than 2000 data rows. again for the memory limitation I cannot have a database so I chose to store data in csv file.
I tried to use open('output.csv', 'r+') as f: ; I'm appending the rows to the end of my csv and I have to check each time the lenght with sum(1 for line in f) to be sure its not more than 2000.
The big problem starts when I reach 2000 rows and I want to ideally delete the first row and add another row to the end or start to write rows from the beginning of the file and overwrite the old rows without deleting evrything, but I dont know how to do it. I tried to use open('output.csv', 'w+') or open('output.csv', 'a+') but it will delete all the contents with w+ while writing only one row and by a+ it just continues to append to the end. I on the otherhand I cannot count the number of rows anymore with both. can you pleas help me which command should I use to start to rewrite each line from the beginning or delete one line from the beginning and append one to the end? I will also appriciate if you can tell me if there is a better chioce than csv files for storing many data or I can use a better way to count the number of rows.
This should help. See comments inline
import pandas as pd
allowed_length = 2 # Set it to the required value
df = pd.read_csv('output.csv') #Read your csv file to df
row_count = df.shape[0] #Get row count
df.loc[row_count] = ['Fridge', 15] #Insert row at end of df. In my case it has only 2 values
#if count of dataframe is greater or equal to allowed_length, the delete first row
if row_count >= allowed_length:
df = df.drop(df.head(1).index)
df.to_csv('output.csv', index=False)
I have a time series in a big text file.
That file is more than 4 GB.
As it is a time series, I would like to read only 1% of lines.
Desired minimalist example:
df = pandas.read_csv('super_size_file.log',
load_line_percentage = 1)
print(df)
desired output:
>line_number, value
0, 654564
100, 54654654
200, 54
300, 46546
...
I can't resample after loading, because it takes too much memory to load it in the first place.
I may want to load chunk by chunk and resample every chunk. But is seems inefficient to me.
Any ideas are welcome. ;)
Anytime I have to deal with a very large file, I ask "What would Dask do?".
Load the large file as a dask.DataFrame, convert the index to a column (workaround due to full index control not being available), and filter on that new column.
import dask.dataframe as dd
import pandas as pd
nth_row = 100 # grab every nth row from the larger DataFrame
dask_df = dd.read_csv('super_size_file.log') # assuming this file can be read by pd.read_csv
dask_df['df_index'] = dask_df.index
dask_df_smaller = dask_df[dask_df['df_index'] % nth_row == 0]
df_smaller = dask_df_smaller.compute() # to execute the operations and return a pandas DataFrame
This will give you rows 0, 100, 200, etc. from the larger file. If you want to cut down the DataFrame to specific columns, do this before calling compute, i.e. dask_df_smaller = dask_df_smaller[['Signal_1', 'Signal_2']]. You can also call compute with the scheduler='processes' option to use all cores on your CPU.
You can enter the number of rows you want to read when you use the read_csv pandas function. Here is what you could do :
import pandas as pd
# Select file
infile = 'path/file'
number_of_lines = x
# Use nrows to choose number of rows
data = pd.read_csv(infile,, nrows = number_of_lines*0.01)
You can also use the chunksize option if you want to read the data chunk by chunk like you mentionned :
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
Take a look at Iterating through files chunk by chunk.
It contains an elegant description how to read a CSV file in chunks.
The basic idea is to pass chunksize parameter (No of rows per chunk).
Then, in a loop, you can read this file chunk by chunk.
This should do what you want.
# Select All From CSV File Where
import csv
# Asks for search criteria from user
search_parts = input("Enter search criteria:\n").split(",")
# Opens csv data file
file = csv.reader(open("C:\\your_path\\test.csv"))
# Go over each row and print it if it contains user input.
for row in file:
if all([x in row for x in search_parts]):
print(row)
# If you only want to read rows 1,000,000 ... 1,999,999
read_csv(..., skiprows=1000000, nrows=999999)
I'm trying to figure out a way to to select only the rows that satisfy my regular expression via Pandas. My actual dataset, data.csv, has one column(the heading is not labeled) and millions of row. The first four rows look like:
5;4Z13H;;L
5;346;4567;;O
5;342;4563;;P
5;3LPH14;4567;;O
and I wrote the following regular expression
([1-9][A-Z](.*?);|[A-Z][A-Z](.*?);|[A-Z][1-9](.*?);)
which would identify 4Z13H; from row 1 and 3LPH14; from row 4. Basically I would like pandas to filter my data and select rows 1 and 4.
So my desired output would be
5;4Z13H;;L
5;3LPH14;4567;;O
I would then like to save the subset of filter rows into a new csv, filteredData.csv. So far I only have this:
import pandas as pd
import numpy as np
import sys
import re
sys.stdout=open("filteredData.csv","w")
def Process(filename, chunksize):
for chunk in pd.read_csv(filename, chunksize=chunksize):
df[0] = df[0].re.compile(r"([1-9][A-Z]|[A-Z][A-Z]|[A-Z][1-9])(.*?);")
sys.stdout.close()
if __name__ == "__main__":
Process('data.csv', 10 ** 4)
I'm still relatively new to python so the code above has some syntax issues(I'm still trying to figure out how to use pandas chunksize). However the main issue is filtering the rows by the regular expression. I'd greatly appreciate anyone's advice
One way is to read the csv as pandas dataframe and then use str.contains to create a mask column
df['mask'] = df[0].str.contains('(\d+[A-Z]+\d+)') #0 is the column name
df = (df[df['mask'] == True]).drop('mask', axis = 1)
You get the desired dataframe, if you wish, you can reset index using df = df.reset_index()
0
0 5;4Z13H;;L
3 5;3LPH14;4567;;O
Second is to first read the csv and create an edit file with only the filtered rows and then read the filtered csv to create the dataframe
with open('filteredData.csv', 'r') as f_in:
with open('filteredData_edit.csv', 'w') as f_outfile:
f_out = csv.writer(f_outfile)
for line in f_in:
line = line.strip()
row = []
if bool(re.search("(\d+[A-Z]+\d+)", line)):
row.append(line)
f_out.writerow(row)
df = pd.read_csv('filteredData_edit.csv', header = None)
You get
0
0 5;4Z13H;;L
1 5;3LPH14;4567;;O
From my experience, I would prefer the second method as it would be more efficient to filter out the undesired rows before creating the dataframe.
I'm "pseudo" creating a .bib file by reading a csv file and then following this structure writing down every thing including newline characters. It's a tedious process but it's a raw form on converting csv to .bib in python.
I'm using Pandas to read csv and write row by row, (and since it has special characters I'm using latin1 encoder) but I'm getting a huge problem: it only reads the first row. From the official documentation I'm using their method on reading row by row, which only gives me the first row (example 1):
row = next(df.iterrows())[1]
But if I remove the next() and [1] it gives me the content of every column concentrated in one field (example 2).
Why is this happenning? Why using the method in the docs does not iterate through all rows nicely? How would be the solution for example 1 but for all rows?
My code:
import csv
import pandas
import bibtexparser
import codecs
colnames = ['AUTORES', 'TITULO', 'OUTROS', 'DATA','NOMEREVISTA','LOCAL','VOL','NUM','PAG','PAG2','ISBN','ISSN','ISSN2','ERC','IF','DOI','CODEN','WOS','SCOPUS','URL','CODIGO BIBLIOGRAFICO','INDEXAÇÕES',
'EXTRAINFO','TESTE']
data = pandas.read_csv('test1.csv', names=colnames, delimiter =r";", encoding='latin1')#, nrows=1
df = pandas.DataFrame(data=data)
with codecs.open('test1.txt', 'w', encoding='latin1') as fh:
fh.write('#Book{Arp, ')
fh.write('\n')
rl = data.iterrows()
for i in rl:
ix = str(i)
fh.write(' Title = {')
fh.write(ix)
fh.write('}')
fh.write('\n')
PS: I'm new to python and programming, I know this code has flaws and it's not the most effective way to convert csv to bib.
The example row = next(df.iterrows())[1] intentionally only returns the first row.
df.iterrows() returns a generator over tuples describing the rows. The tuple's first entry contains the row index and the second entry is a pandas series with your data of the row.
Hence, next(df.iterrows()) returns the next entry of the generator. If next has not been called before, this is the very first tuple.
Accordingly, next(df.iterrows())[1] returns the first row (i.e. the second tuple entry) as a pandas series.
What you are looking for is probably something like this:
for row_index, row in df.iterrows():
convert_to_bib(row)
Secondly, all your writing to your file handle fh must happen within the block with codecs.open('test1.txt', 'w', encoding='latin1') as fh:
because at the end of the block the file handle will be closed.
For example:
with codecs.open('test1.txt', 'w', encoding='latin1') as fh:
# iterate through all rows
for row_index, row in df.iterrows():
# iterate through all elements in the row
for colname in df.columns:
row_element = row[colname]
fh.write('%s = {%s},\n' % (colname, str(row_element)))
Still I am not sure if the names of the columns exactly match the bibtex fields you have in mind. Probably you have to convert these first. But I hope you get the principle behind the iterations :-)
I have a very large data set and I can't afford to read the entire data set in. So, I'm thinking of reading only one chunk of it to train but I have no idea how to do it.
If you only want to read the first 999,999 (non-header) rows:
read_csv(..., nrows=999999)
If you only want to read rows 1,000,000 ... 1,999,999
read_csv(..., skiprows=1000000, nrows=999999)
nrows : int, default None Number of rows of file to read. Useful for
reading pieces of large files*
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file
and for large files, you'll probably also want to use chunksize:
chunksize : int, default None
Return TextFileReader object for iteration
pandas.io.parsers.read_csv documentation
chunksize= is a very useful argument because the output of read_csv after passing it is an iterator, so you can call the next() function on it to get the specific chunk you want without straining your memory. For example, to get the first n rows, you can use:
chunks = pd.read_csv('file.csv', chunksize=n)
df = next(chunks)
For example, if you have a time-series data and you want to make the first 700k rows the train set and the remainder test set, then you can do so by:
chunks = pd.read_csv('file.csv', chunksize=700_000)
train_df = next(chunks)
test_df = next(chunks)
If you do not want to use Pandas, you can use csv library and to limit row readed with interaction break.
For example, I needed to read a list of files stored in csvs list to get the only the header.
for csvs in result:
csvs = './'+csvs
with open(csvs,encoding='ANSI', newline='') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
count=0
for row in csv_reader:
if count:
break;