It seems that you can look at columns in a file no problem, but there's no apparent way to look at rows. I know I can read the entire file (CSV or excel) into a crazy huge dataframe in order to select rows, but I'd rather be able to grab particular rows straight from the file and store those in a reasonably sized dataframe.
I do realize that I could just transpose/pivot the df before saving it to the aforementioned CVS/Excel file. This would be a problem for Excel because I'd run out of columns (the transposed rows) far too quickly. I'd rather use Excel than CSV.
My original, not transposed data file has 9000+ rows and 20ish cols. I'm using Excel 2003 which supports up to 256 columns.
EDIT: Figured out a solution that works for me. It's a lot simpler than I expected. I did end up using CSV instead of Excel (I found no serious difference in terms of my project) Here it is for whoever may have the same problem:
import pandas as pd
selectionList = (2, 43, 792, 4760) #rows to select
df = pd.read_csv(your_csv_file, index_col=0).T
selection = {}
for item in selectionList:
selection[item] = df[item]
selection = pd.DataFrame.from_dict(selection)
selection.T.to_csv(your_path)
I think you can use the skiprows and nrows arguments in pandas.read_csv to pick out individual rows to read in.
With skiprows, you can provide it a long list (0 indexed) of rows not to import , e.g. [0,5,6,10]. That might end up being a huge list though. If you provide it a single integer, it will skip that number of rows and start reading. Set nrows to whatever to pick up the number of rows you want at the point where you have it start.
If I've misunderstood the issue, let me know.
Related
I have a big size excel files that I'm organizing the column names into a unique list.
The code below works, but it takes ~9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)
Using pandas to extract just the column names of a large excel file is very inefficient.
You can use openpyxl for this:
from openpyxl import load_workbook
wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)
columns = {}
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
columns = value
Assuming you only have one sheet, you will get a tuple of column names here.
If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().
Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))
The stronger suggestion is to change datatype if you specifically want to address this issue.
I'm trying to read a rather large CSV (2 GB) with pandas to do some datatype manipulation and joining with other dataframes I have already loaded before. As I want to be a little careful with memory I decided to read the it in chunks. For the purpose of the questions here is an extract of my CSV layout with dummy data (cant really share the real data, sorry!):
institution_id,person_id,first_name,last_name,confidence,institution_name
1141414141,4141414141,JOHN,SMITH,0.7,TEMP PLACE TOWN
10123131114,4141414141,JOHN,SMITH,0.7,TEMP PLACE CITY
1003131313188,4141414141,JOHN,SMITH,0.7,"TEMP PLACE,TOWN"
18613131314,1473131313,JOHN,SMITH,0.7,OTHER TEMP PLACE
192213131313152,1234242383,JANE,SMITH,0.7,"OTHER TEMP INC, LLC"
My pandas code to read the files:
inst_map = pd.read_csv("data/hugefile.csv",
engine="python",
chunksize=1000000,
index_col=False)
print("processing institution chunks")
chunk_list = [] # append each chunk df here
for chunk in inst_map:
# perform data filtering
chunk['person_id'] = chunk['person_id'].progress_apply(zip_check)
chunk['institution_id'] = chunk['institution_id'].progress_apply(zip_check)
# Once the data filtering is done, append the chunk to list
chunk_list.append(chunk)
ins_processed = pd.concat(chunk_list)
The zip check function that I'm applying is basically performing some datatype checks and then converting the value that it gets into an integer.
Whenever I read the CSV it will only ever read the institution_id column and generates an index. The other columns in the CSV are just silently dropped.
When i dont use index_col=False as an option it will just set 1141414141/4141414141/JOHN/SMITH/0.7 (basically the first 5 values in the row) as the index and only institution_id as the header while only reading the institution_name into the dataframe as a value.
I have honestly no clue what is going on here, and after 2 hours of SO / google search I decided to just ask this as a question. Hope someone can help me, thanks!
The issue came out to be that something went wrong while transferring the large CSV file to my remote processing server (which sufficient RAM to handle in memory editing). Processing the chunks on my local computer seems to work.
After reuploading the file it worked fine on the remote server.
I have a data set which more more than 100mb in size and also many in number of files. These files have more than 20 columns and about more than 1 million rows.
The main problem with data is:
Headers are repeating -- Duplicate header rows
Duplicate rows in full i.e. data from all the columns in that particular row is duplicate.
Without bothering about the which column or how many columns .. only need to Keep the first occurrence and then remove the rest.
I did find too many examples but what I am looking for is the input and output both need to be same file. The only reason to seek help is, I want the same file to be edited.
sample Input: Here
https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=0
Appreciate the help thanks in advance..
If the number of duplicate headers is known and constant, skip those rows:
csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1', skiprows=4)
Alternatively, which comes w/ the bonus of removing all duplicates, based on all columns, do this:
csv = pd.read_csv('https://www.dropbox.com/s/sl7y5zm0ppqfjn6/sample_duplicate.csv?dl=1')
csv = csv.drop_duplicates()
Now you still have a header line in the data, just skip it:
csv = csv.iloc[1:]
You certainly can then overwrite the input file with pandas.DataFrame.to_csv
Suppose I have a csv file with 400 columns. I cannot load the entire file into a DataFrame (won't fit in memory). However, I only really want 50 columns, and this will fit in memory. I don't see any built in Pandas way to do this. What do you suggest? I'm open to using the PyTables interface, or pandas.io.sql.
The best-case scenario would be a function like: pandas.read_csv(...., columns=['name', 'age',...,'income']). I.e. we pass a list of column names (or numbers) that will be loaded.
Ian, I implemented a usecols option which does exactly what you describe. It will be in upcoming pandas 0.10; development version will be available soon.
Since 0.10, you can use usecols like
df = pd.read_csv(...., usecols=['name', 'age',..., 'income'])
There's no default way to do this right now. I would suggest chunking the file and iterating over it and discarding the columns you don't want.
So something like pd.concat([x.ix[:, cols_to_keep] for x in pd.read_csv(..., chunksize=200)])
I have been trying to load in a large-ish file (~480MB, 5,250,000 records, stock price daily data -dt, o, h, l, c, v, val , adj, fv, sym, code - for about 4,500 instruments) into pandas using read_csv. It runs fine, and creates the DataFrame. However, on conversion to a Panel, the values for several stocks are way off, and nowhere close to the values in the original csv file.
I then attempted to use the chunksize parameter in read_csv, and used a for loop to:
reader = read_csv("bigfile.csv",index_col=[0,9],parse_dates=True,names=['n1','n2',...,'nn'], chunksize=100000)
new_df = DataFrame(reader.get_chunk(1))
for chunk in reader:
new_df = concat(new_df, chunk)
This reads in the data, but:
I get the same erroneous values (edit:) when converting to a Panel
It takes ages longer than the plain read_csv (no iterator)
Any ideas how to get around this?
Edit:
Changed the question to reflect the problem - the dataframe is fine, conversion to a Panel is the problem. Found the error appearing even after splitting the input csv file, merging and then converting to a panel. If i maintain a multi-index DataFrame, there is no problem and the values are represented correctly.
Some bugs have been fixed in the DataFrame to Panel code. Please try with the latest pandas version (preferably upcoming 0.10) and let us know if you're still having issues.
If you know a few specific values that are off, you might just examine those lines specifically in your csv file. You should also check out the docs on csv, particularly in terms of dialects and the Sniffer class. You might be able to find some setting that will correctly detect how the the file is delimited.
If you find that the errors go away when you look only at specific lines, that probably means that there is an erroneous/missing line break somewhere that is throwing things off.
Finally, if you can't seem to find patterns of correct/incorrect lines, you might try (randomly or otherwise) selecting a subset of the lines in your csv file and see whether the error is occurring because of the size of the file (I'd guess this would be unlikely, but I'm not sure).