Dataframe instance management in Python - python

I recently worked on a project parsing CSV files with cable modem MAC addresses (CMMAC) data that made it useful to incorporate dataframes through the Pandas module. One of the problems I encountered related to the overall approach to and structure of the dataframes themselves. Specifically I was concerned with having to increment the number of instances of dataframes to perform specific actions on the data. I did not feel that having to invoke "df1", "df2", "df3", etc was an efficient approach to writing in Python.
Below is a segment of the code where I had to instantiate the dataframes for different actions. The sample files (file1.csv and file2.csv) are identical and posted below as well.
file1.csv and file2.csv
cmmac,match
AABBCCDDEEFF,true
001122334455,false
001122334455,false
Python script:
import os
import glob
from functools import partial
import pandas as pd
#read and concatenate all CSV files in working directory
df1 = pd.concat(map(partial(pd.read_csv, header=0), glob.glob(os.path.join('', "*.csv"))))
#sort by column labeled "cmmac"
df2 = df1.sort_values(by='cmmac')
#delete any duplicate records
df3 = df2.drop_duplicates()
#convert MAC address format to colon notation (e.g. 001122334455 to 00:11:22:33:44:55)
df3['cmmac'] = df3['cmmac'].apply(lambda x: ':'.join(x[i:i+2] for i in range(0, len(x), 2)))
There were additional actions that were performed on the data in the CSV files and by the end I had thirteen dataframes (df13). With more complex projects I would have been in a death spiral of dataframes using this method.
The question I have is: how should dataframes be managed in order to avoid using this many instances? If it was necessary to drop a column or rearrange the columns does each one of those actions require invoking a new dataframe? In "df1" I am able to combine two distinct actions, which include reading in all CSV files and concatenating them. I was unable to add additional actions but even so that line would eventually become difficult to read. Which approach have you adopted when working with dataframes that incorporated many smaller tasks? Thanks.

Related

Adding file name column to Dask DataFrame

I have a data set of around 400 CSV files containing a time series of multiple variables (my CSV has a time column and then multiple columns of other variables).
My final goal is the choose some variables and plot those 400 time series in a graph.
In order to do so, I tried to use Dask to read the 400 files and then plot them.
However, from my understanding, In order to actually draw 400 time series and not a single appended data frame, I should groupby the data by the file name it came from.
Is there any Dask efficient way to add a column to each CSV so I could later groupby my results?
A parquet files is also an option.
For example, I tried to do something like this:
import dask.dataframe as dd
import os
filenames = ['part0.parquet', 'part1.parquet', 'part2.parquet']
df = dd.read_parquet(filenames, engine='pyarrow')
df = df.assign(file=lambda x: filenames[x.index])
df_grouped = df.groupby('file')
I understand that I can use from_delayed() but then I lose al the parallel computation.
Thank you
If you are can work with CSV files, then passing include_path_column option might be sufficient for your purpose:
from dask.dataframe import read_csv
ddf = read_csv("some_path/*.csv", include_path_column="file_path")
print(ddf.columns)
# the list of columns will include `file_path` column
There is no equivalent option for read_parquet, but something similar can be achieved with delayed. Using delayed will not remove parallelism, the code just need to make sure that the actual calculation is done after the delayed tasks are defined.

Split a spark dataframe into multiple frames and write as CSV

I have a use case where in I am reading data from a source into a dataframe, doing a groupBy on a field and essentially breaking that dataframe into an array of dataframes.
My target state is to have all these dataframes be written as individual CSVs files in S3 ( CSV being they need to be downloaded by the client and need to be human readable ).
What's the best way of going about this?
I used this to split df into df_array : df_array = [(df.where(df[column_name] == i),i) for i in distinct_values]
And df.toPandas().to_csv(output_path +'.csv',index=False) individually on dataframes to convert to CSV files - but the challenges being faced in this approach are
My understanding is since I require a single CSV file per my grouping field, to_csv will bring data from all worker nodes to the driver and and may give driver OOM issue.
I am unable to use python multiprocessing to write the individual dataframes to S3 since data is distributed on worker nodes and gives me an error : Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion
No space left on device.
The pipeline is pretty slow as well, what is the better way I can approach this use case?
[EDIT]
I want to control the name of the CSV file which gets created as well. Target state is 1 CSV file per my group-by field ( let's call that Name ) so if there are 10 different Names in my initial df, output will be 10 CSV files each with the title as Name1.csv, Name2.csv and so on
As you're using pyspark, why don't you use the repartition and partitionBy to achieve your goal?
df.repartition(1).write.partitionBy('grouping_field1', 'grouping_field2', ...).save('/path/to/save', format='csv')

Using PySpark to efficiently combine many small csv files (130,000 with 2 columns in each) into one large frame

This is another follow up to an earlier question I posted How can I merge these many csv files (around 130,000) using PySpark into one large dataset efficiently?
I have the following dataset https://fred.stlouisfed.org/categories/32263/downloaddata/INTRNTL_csv_2.zip
In it, there's a list of files (around 130,000). In the main directory with their sub-directories listed, so in there the first cell might be A/AAAAA, and the file would be located at /data/A/AAAAA.csv
The files are all with a similar format, the first column is called DATE and the second column is a series which are all named VALUE. So first of all, the VALUE column name needs to be renamed to the file name in each csv file. Second, the frames need to be full outer joined with each other with the DATE as the main index. Third, I want to save the file and be able to load and manipulate it. The file should be around N rows (number of dates) X 130,001 roughly.
I am trying to full outer join all the files into a single dataframe, I previously tried pandas but ran out of memory when trying to concat the list of files and someone recommended that I try to use PySpark instead.
In a previous post I was told that I could do this:
df = spark.read.csv("/kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv", "date DATE, value DOUBLE")
But all the columns are named value and the frame just becomes two columns, the first column is DATE and second column is VALUE, it loads quite fast, around 38 seconds and around 3.8 million values by 2 columns, so I know that it's not doing the full outer join, it's appending the files row wise.
So I tried the following code:
import pandas as pd
import time
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-dataframe-demo').getOrCreate()
from pyspark.sql import *
from pyspark.sql.functions import col
from pyspark.sql import DataFrame
from pyspark.sql.types import *
filelist = pd.read_excel("/kaggle/input/list/BF_csv_2.xlsx") #list of filenames
firstname = min(filelist.File)
length = len(filelist.File)
dff = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + firstname, inferSchema = True, header = True).withColumnRenamed("VALUE",firstname) #read file and changes name of column to filename
for row in filelist.File.items():
if row == firstname:
continue
print (row[1],length,end='', flush=True)
df = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + row[1], inferSchema = True, header = True).withColumnRenamed("VALUE",row[1][:-4])
#df = df.select(col("DATE").alias("DATE"),col("VALUE").alias(row[1][:-4]))
dff = dff.join(df, ['DATE'], how='full')
length -= 1
dff.write.save('/kaggle/working/whatever', format='parquet', mode='overwrite')
So to test it, I try to load the the df.show() function after 3 columns are merged and it's quite fast. But, when I try around 25 columns, it takes around 2 minutes. When I try 500 columns it's next to impossible.
I don't think I'm doing it right. The formatting and everything is correct. But why is it taking so long? How can I use PySpark properly? Are there any better libraries to achieve what I need?
Spark doesn't do anything magical compared to other software. The strength of spark is parallel processing. Most of the times that means you can use multiple machines to do the work. If you are running spark locally you may have the same issues you did when using pandas.
That being said, there might be a way for you to run it locally using Spark because it can spill to disk under certain conditions and does not need to have everything in memory.
I'm not verse in PySpark, but the approach I'd take is:
load all the files using like you did /kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv
Use the function from pyspark.sql.functions import input_file_name that allows you to get the path for each record in your DF (df.select("date", "value", input_file_name().as("filename")) or similar)
Parse the path into a format that I'd like to have as a column (eg. extract filename)
the schema should look like date, value, filename at this step
use the PySpark equivalent of df.groupBy("date").pivot("filename").agg(first("value")). Note: I used first() because I think you have 1 or 0 records possible
Also try: setting the number of partitions to be equal to number of dates you got
If you want output as a single file, do not forget to repartition(1) before df.write. This step might be problematic depending on data size. You do not need to do this if you plan to keep using Spark for your work as you could load the data using the same approach as in step 1 (/new_result_data/*.csv)

Looking for a way to overcome 'MemoryError' in Spyder while merging dataframes together

I am running the script below.
import numpy as np
import pandas as pd
# load all data to respective dataframes
orders = pd.read_csv('C:\\my_path\\orders.csv')
products = pd.read_csv('C:\\my_path\\products.csv')
order_products = pd.read_csv('C:\\my_path\\order_products.csv')
# check out data sets
print(orders.shape)
print(products.shape)
print(order_products.shape)
# merge different dataframes into one consolidated dataframe
df = pd.merge(order_products, products, on='product_id')
df = pd.merge(df, orders, on='order_id')
On the last line of merging the second data frame, I get this result:
out = np.empty(out_shape, dtype=dtype)
MemoryError
The file named 'order_products.csv' is around 550MB, 'orders.csv' is 100MB, and 'products.csv' is just 2MB. I have tried running this process a few times and I always get the MemoryError issue. It doesn't seem like the files are really, really massive, but I guess it's all relative, because on my old machine, it's just too much. Is there a simple way to read these files into dataframes in chunks and then merge these together in chunks?
I am working with Spyder 3.3.4, Python 3.7, and Windows 7 on an old ThinkPad.
Thanks.
try to use the concept of slicing and chunking. what it says is you've reach the highest possible length of what your computer's ram can take
orders_100 = orders[:100]
products_100 = product[:100]
order_products_100 = order_products[:100]
Then do pd.merge()

Python read in multiple .txt files and row bind using pandas

I'm coming from R (and SAS) and am having an issue reading in a large set of .txt files (all stored in the same directory), and creating one large dataframe in pandas. So far I have attempted an amalgamation of code - all of which fails miserably. I assume this is a simple task but lack the experience in python...
If it helps this is the data I would like to create one large dataframe with: http://www.ssa.gov/oact/babynames/limits.html
- the state specific sets (50 in total, named for their state abbreviation.txt)
Please help!
import pandas as pd
import glob
filelist = glob.glob("C:\Users\Dell\Downloads\Names\*.txt")
names = ['state', 'gender', 'year', 'name', 'count']
Then, I was thinking of using pd.concat, but am not sure - essentially I want to read in each dataset and then row.bind the sets together (given they all have the same columns)...
concat is nice since "join" is set to "outer" (i.e. union of index) by default. You could just as easily use df.join(), but must specify "how" as "outer". Either way, you can build a dataframe quite simply:
import pandas as pd
from glob import glob as gg
data = pd.DataFrame()
names = ['state', 'gender', 'year', 'name', 'count']
for f in gg('*.txt'):
tmp = pd.read_csv(f,columns=names)
data = pd.concat([data,tmp],axis=0,ignore_index=True)

Categories

Resources