How can I write only first N rows or from P to Q rows to csv from pandas dataframe without subseting the df first? I cannot subset the data I want to export because of memory issues.
I am thinking of a function which writes to csv row by row.
Thank you
Use head- Return the first n rows.
Ex.
import pandas as pd
import numpy as np
date = pd.date_range('20190101',periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=date, columns=list('ABCD'))
#wtire only top two rows into csv file
print(df.head(2).to_csv("test.csv"))
Does this work for you?
df.iloc[:N, :].to_csv()
Or
df.iloc[P:Q, :].to_csv()
I believe df.iloc generally produces references to the original dataframe rather than copying the data.
If this still doesn't work, you might also try setting the chunksize in the to_csv call. It may be that pandas is able to create the subset without using much more memory, but then it makes a complete copy of the rows written to each chunk. If the chunksize is the whole frame, you would end up copying the whole frame at that point and running out of memory.
If all else fails, you can loop through df.iterrows() or df.iloc[P:Q, :].iterrows() or df.iloc[P:Q, :].itertuples() and write each row using the csv module (possibly writer.writerows(df.iloc[P:Q, :].itertuples()).
Maybe you can select the rows index that you want to write on your CSV file like this:
df[df.index.isin([1, 2, ...])].to_csv('file.csv')
Or use this one:
df.loc[n:n].to_csv('file.csv')
Related
I have a data set of around 400 CSV files containing a time series of multiple variables (my CSV has a time column and then multiple columns of other variables).
My final goal is the choose some variables and plot those 400 time series in a graph.
In order to do so, I tried to use Dask to read the 400 files and then plot them.
However, from my understanding, In order to actually draw 400 time series and not a single appended data frame, I should groupby the data by the file name it came from.
Is there any Dask efficient way to add a column to each CSV so I could later groupby my results?
A parquet files is also an option.
For example, I tried to do something like this:
import dask.dataframe as dd
import os
filenames = ['part0.parquet', 'part1.parquet', 'part2.parquet']
df = dd.read_parquet(filenames, engine='pyarrow')
df = df.assign(file=lambda x: filenames[x.index])
df_grouped = df.groupby('file')
I understand that I can use from_delayed() but then I lose al the parallel computation.
Thank you
If you are can work with CSV files, then passing include_path_column option might be sufficient for your purpose:
from dask.dataframe import read_csv
ddf = read_csv("some_path/*.csv", include_path_column="file_path")
print(ddf.columns)
# the list of columns will include `file_path` column
There is no equivalent option for read_parquet, but something similar can be achieved with delayed. Using delayed will not remove parallelism, the code just need to make sure that the actual calculation is done after the delayed tasks are defined.
I have a big size excel files that I'm organizing the column names into a unique list.
The code below works, but it takes ~9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)
Using pandas to extract just the column names of a large excel file is very inefficient.
You can use openpyxl for this:
from openpyxl import load_workbook
wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)
columns = {}
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
columns = value
Assuming you only have one sheet, you will get a tuple of column names here.
If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().
Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))
The stronger suggestion is to change datatype if you specifically want to address this issue.
I am trying to populate an empty dataframe by using the csv module to iterate over a large tab-delimited file, and replacing each row in the dataframe with these values. (Before you ask, yes I have tried all the normal read_csv methods, and nothing has worked because of dtype issues and how large the file is).
I first made an empty numpy array using np.empty, using the dimensions of my data. I then converted this to a pandas DataFrame. Then, I did the following:
with open(input_file) as csvfile:
reader = csv.DictReader(csvfile,delimiter='\t')
row_num = 0
for row in reader:
for key, value in row.items():
df.loc[row_num,key] = value
row_num += 1
This is working great, except that my file has 900,000 columns, so it is unbelievably slow. This also feels like something that pandas could do more efficiently, but I've been unable to find how. The dictionary for each row given by DictReader looks like:
{'columnName1':<value>,'columnName2':<value> ...}
Where the values are what I want to put in the dataframe in those columns for that row.
Thanks!
So what you could do in this case is to build smaller chunks of your big csv data file. I had the same issue with a 32GB Csv-File, so I had to build chunks. After reading them in you could work with them.
# read the large csv file with specified chunksize
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
chunksize=1000000 sets how many row are read in at once
Helpfull website:
https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c
This is another follow up to an earlier question I posted How can I merge these many csv files (around 130,000) using PySpark into one large dataset efficiently?
I have the following dataset https://fred.stlouisfed.org/categories/32263/downloaddata/INTRNTL_csv_2.zip
In it, there's a list of files (around 130,000). In the main directory with their sub-directories listed, so in there the first cell might be A/AAAAA, and the file would be located at /data/A/AAAAA.csv
The files are all with a similar format, the first column is called DATE and the second column is a series which are all named VALUE. So first of all, the VALUE column name needs to be renamed to the file name in each csv file. Second, the frames need to be full outer joined with each other with the DATE as the main index. Third, I want to save the file and be able to load and manipulate it. The file should be around N rows (number of dates) X 130,001 roughly.
I am trying to full outer join all the files into a single dataframe, I previously tried pandas but ran out of memory when trying to concat the list of files and someone recommended that I try to use PySpark instead.
In a previous post I was told that I could do this:
df = spark.read.csv("/kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv", "date DATE, value DOUBLE")
But all the columns are named value and the frame just becomes two columns, the first column is DATE and second column is VALUE, it loads quite fast, around 38 seconds and around 3.8 million values by 2 columns, so I know that it's not doing the full outer join, it's appending the files row wise.
So I tried the following code:
import pandas as pd
import time
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-dataframe-demo').getOrCreate()
from pyspark.sql import *
from pyspark.sql.functions import col
from pyspark.sql import DataFrame
from pyspark.sql.types import *
filelist = pd.read_excel("/kaggle/input/list/BF_csv_2.xlsx") #list of filenames
firstname = min(filelist.File)
length = len(filelist.File)
dff = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + firstname, inferSchema = True, header = True).withColumnRenamed("VALUE",firstname) #read file and changes name of column to filename
for row in filelist.File.items():
if row == firstname:
continue
print (row[1],length,end='', flush=True)
df = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + row[1], inferSchema = True, header = True).withColumnRenamed("VALUE",row[1][:-4])
#df = df.select(col("DATE").alias("DATE"),col("VALUE").alias(row[1][:-4]))
dff = dff.join(df, ['DATE'], how='full')
length -= 1
dff.write.save('/kaggle/working/whatever', format='parquet', mode='overwrite')
So to test it, I try to load the the df.show() function after 3 columns are merged and it's quite fast. But, when I try around 25 columns, it takes around 2 minutes. When I try 500 columns it's next to impossible.
I don't think I'm doing it right. The formatting and everything is correct. But why is it taking so long? How can I use PySpark properly? Are there any better libraries to achieve what I need?
Spark doesn't do anything magical compared to other software. The strength of spark is parallel processing. Most of the times that means you can use multiple machines to do the work. If you are running spark locally you may have the same issues you did when using pandas.
That being said, there might be a way for you to run it locally using Spark because it can spill to disk under certain conditions and does not need to have everything in memory.
I'm not verse in PySpark, but the approach I'd take is:
load all the files using like you did /kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv
Use the function from pyspark.sql.functions import input_file_name that allows you to get the path for each record in your DF (df.select("date", "value", input_file_name().as("filename")) or similar)
Parse the path into a format that I'd like to have as a column (eg. extract filename)
the schema should look like date, value, filename at this step
use the PySpark equivalent of df.groupBy("date").pivot("filename").agg(first("value")). Note: I used first() because I think you have 1 or 0 records possible
Also try: setting the number of partitions to be equal to number of dates you got
If you want output as a single file, do not forget to repartition(1) before df.write. This step might be problematic depending on data size. You do not need to do this if you plan to keep using Spark for your work as you could load the data using the same approach as in step 1 (/new_result_data/*.csv)
It seems that you can look at columns in a file no problem, but there's no apparent way to look at rows. I know I can read the entire file (CSV or excel) into a crazy huge dataframe in order to select rows, but I'd rather be able to grab particular rows straight from the file and store those in a reasonably sized dataframe.
I do realize that I could just transpose/pivot the df before saving it to the aforementioned CVS/Excel file. This would be a problem for Excel because I'd run out of columns (the transposed rows) far too quickly. I'd rather use Excel than CSV.
My original, not transposed data file has 9000+ rows and 20ish cols. I'm using Excel 2003 which supports up to 256 columns.
EDIT: Figured out a solution that works for me. It's a lot simpler than I expected. I did end up using CSV instead of Excel (I found no serious difference in terms of my project) Here it is for whoever may have the same problem:
import pandas as pd
selectionList = (2, 43, 792, 4760) #rows to select
df = pd.read_csv(your_csv_file, index_col=0).T
selection = {}
for item in selectionList:
selection[item] = df[item]
selection = pd.DataFrame.from_dict(selection)
selection.T.to_csv(your_path)
I think you can use the skiprows and nrows arguments in pandas.read_csv to pick out individual rows to read in.
With skiprows, you can provide it a long list (0 indexed) of rows not to import , e.g. [0,5,6,10]. That might end up being a huge list though. If you provide it a single integer, it will skip that number of rows and start reading. Set nrows to whatever to pick up the number of rows you want at the point where you have it start.
If I've misunderstood the issue, let me know.