after writing Pyspark dataFrame to csv, half of rows are missing - python

I tried to create a sample data in spark DataFrame, and write the resulting sample into csv file. number of records in dataFrame is 1000000 but once I wrote the DataFrame into a csv file, only 462690 rows have been written.
I've tried different options in write method but the problem still remains. even tried to convert spark dataframe to pandas dataframe, the same problem happens in this case too.
df.write.options(header='True', delimiter = ',' , quoteAll= True) \
.csv("PresSample2.csv")
or
df2 = df.select("*").toPandas()

Related

Loading csv into pandas dataframe only reads 6 columns

I'm trying to read a csv file into a dataframe. The original csv has about 111 columns (not all needed, I want to drop some columns later) but on loading the data, the resultant dataframe only has 6 columns, though has the correct amount of rows (~300).
I have double checked this by setting the options to display the whole dataframe.
Any ideas why this might be?
I can't see anything wrong with the csv, this is the header row of the csv (sorry its so long)
Edit:
This is the code I have used:
processdf = pd.read_csv(thefile, compression='gzip')
timestamp,temp1,temp2,temp3,temp4,valveOpen,sontexTimestamp,sontexModbusAddress,sontexModbusParity,sontexModbusFlowControl,sontexModbusStopBits,sontexModbusCustomID,sontexModbusFabricationNo,sontexModbusFirmwareVer,sontexModbusBaudRate,sontexModbusRunningHours,sontexErrors,sontexEnergyUnit,sontexEnergyDecimals,sontexEnergyTotalizerHeating,sontexEnergyTotalizerTariff1,sontexEnergyTotalizerTariff2,sontexEnergyStoredST1,sontexEnergyTariff1ST1,sontexEnergyTariff2ST1,sontexEnergyStoredST2,sontexEnergyTariff1ST2,sontexEnergyTariff2ST2,sontexVolumeUnit,sontexVolumeDecimals,sontexVolume,sontexVolumeTariff1,sontexVolumeTariff2,sontexVolumeStoredST1,sontexVolumeTariff1ST1,sontexVolumeTariff2ST1,sontexVolumeStoredST2,sontexVolumeTariff1ST2,sontexVolumeTariff2ST2,sontexFlowUnit,sontexFlowDecimals,sontexFlow,sontexTempHighUnit,sontexTempHighDecimals,sontexTempHigh,sontexTempLowUnit,sontexTempLowDecimals,sontexTempLow,sontexTempDiffUnit,sontexTempDiffDecimals,sontexTempDiff,sontexOK,alphaSerial,alphaTotalToGrid,alphaTotalFromGrid,alphaTotalApparentPower,alphaTotalPowerFactor,alphaBatterySOC,alphaBatteryCharging,alphaBatteryDischarging,alphaBatteryChargeRelay,alphaBatteryDischargeRelay,alphaBmuSoftwareVersion,alphaLmuSoftwareVersion,alphaIsoSoftwareVersion,alphaBatteryNum,alphaBatteryCapacity,alphaBatteryType,alphaBatterySoh,alphaBatteryWarning,alphaBatteryFault,alphaBatteryChargeEnergy,alphaBatteryDischargeEnergy,alphaBatteryEnergyChargeFromGrid,alphaBatteryPower,alphaBatteryRemainingMins,alphaBatteryImpChargeSoc,alphaBatteryImpDischargeSoc,alphaBatteryRemainChargeSoc,alphaBatteryRemainDischargeSoc,alphaBatteryMaxChargePower,alphaBatteryMaxDischargePower,alphaBatteryMosClosed,alphaBatterySocCalibEnabled,alphaBatterySingleCutError,alphaBatteryFault1,alphaBatteryFault2,alphaBatteryFault3,alphaBatteryFault4,alphaBatteryFault5,alphaBatteryFault6,alphaBatteryWarning1,alphaBatteryWarning2,alphaBatteryWarning3,alphaBatteryWarning4,alphaBatteryWarning5,alphaBatteryWarning6,alphaBatteryOK,obVoltage,obFrequency,obCurrent,obActivePower,obApparentPower,obReactivePower,obPowerFactor,obImportActiveEnergy,obImportReactiveEnergy,obExportActiveEnergy,obExportReactiveEnergy,obTotalActiveEnergy,obOK

Writing DataFrame to csv file. Each Column in a seperate cell does not work. Delimitor does not do anything

I have a Dataframe with 2 columns and 30 rows. I want to write it to a csv file but it always ends up writing my columns into one cell. Basically my Dataframe is 30x2 and my csv file is 30x1. I have tried every answer I found on stackoverlfow but nothing worked.
I have tried changing seperator from , to ; but nothing changed
currently it looks like this
df_ts = pd.DataFrame(list(zip(file_name, timestamps_err)), columns=['FIle', 'Timestamp'])
df_ts.to_csv('Results_test_data.csv',mode="a", sep=',', index=False)

Using PySpark to efficiently combine many small csv files (130,000 with 2 columns in each) into one large frame

This is another follow up to an earlier question I posted How can I merge these many csv files (around 130,000) using PySpark into one large dataset efficiently?
I have the following dataset https://fred.stlouisfed.org/categories/32263/downloaddata/INTRNTL_csv_2.zip
In it, there's a list of files (around 130,000). In the main directory with their sub-directories listed, so in there the first cell might be A/AAAAA, and the file would be located at /data/A/AAAAA.csv
The files are all with a similar format, the first column is called DATE and the second column is a series which are all named VALUE. So first of all, the VALUE column name needs to be renamed to the file name in each csv file. Second, the frames need to be full outer joined with each other with the DATE as the main index. Third, I want to save the file and be able to load and manipulate it. The file should be around N rows (number of dates) X 130,001 roughly.
I am trying to full outer join all the files into a single dataframe, I previously tried pandas but ran out of memory when trying to concat the list of files and someone recommended that I try to use PySpark instead.
In a previous post I was told that I could do this:
df = spark.read.csv("/kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv", "date DATE, value DOUBLE")
But all the columns are named value and the frame just becomes two columns, the first column is DATE and second column is VALUE, it loads quite fast, around 38 seconds and around 3.8 million values by 2 columns, so I know that it's not doing the full outer join, it's appending the files row wise.
So I tried the following code:
import pandas as pd
import time
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-dataframe-demo').getOrCreate()
from pyspark.sql import *
from pyspark.sql.functions import col
from pyspark.sql import DataFrame
from pyspark.sql.types import *
filelist = pd.read_excel("/kaggle/input/list/BF_csv_2.xlsx") #list of filenames
firstname = min(filelist.File)
length = len(filelist.File)
dff = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + firstname, inferSchema = True, header = True).withColumnRenamed("VALUE",firstname) #read file and changes name of column to filename
for row in filelist.File.items():
if row == firstname:
continue
print (row[1],length,end='', flush=True)
df = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + row[1], inferSchema = True, header = True).withColumnRenamed("VALUE",row[1][:-4])
#df = df.select(col("DATE").alias("DATE"),col("VALUE").alias(row[1][:-4]))
dff = dff.join(df, ['DATE'], how='full')
length -= 1
dff.write.save('/kaggle/working/whatever', format='parquet', mode='overwrite')
So to test it, I try to load the the df.show() function after 3 columns are merged and it's quite fast. But, when I try around 25 columns, it takes around 2 minutes. When I try 500 columns it's next to impossible.
I don't think I'm doing it right. The formatting and everything is correct. But why is it taking so long? How can I use PySpark properly? Are there any better libraries to achieve what I need?
Spark doesn't do anything magical compared to other software. The strength of spark is parallel processing. Most of the times that means you can use multiple machines to do the work. If you are running spark locally you may have the same issues you did when using pandas.
That being said, there might be a way for you to run it locally using Spark because it can spill to disk under certain conditions and does not need to have everything in memory.
I'm not verse in PySpark, but the approach I'd take is:
load all the files using like you did /kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv
Use the function from pyspark.sql.functions import input_file_name that allows you to get the path for each record in your DF (df.select("date", "value", input_file_name().as("filename")) or similar)
Parse the path into a format that I'd like to have as a column (eg. extract filename)
the schema should look like date, value, filename at this step
use the PySpark equivalent of df.groupBy("date").pivot("filename").agg(first("value")). Note: I used first() because I think you have 1 or 0 records possible
Also try: setting the number of partitions to be equal to number of dates you got
If you want output as a single file, do not forget to repartition(1) before df.write. This step might be problematic depending on data size. You do not need to do this if you plan to keep using Spark for your work as you could load the data using the same approach as in step 1 (/new_result_data/*.csv)

Keeping column alignment when reading merged cells in pandas

I am trying to read a bunch of tables into a dataframe using pandas. The files have an extension of .xls, but appear to be in HTML format, so I'm using the pandas.read_html() function. The issue I face is that the first column contains merged cells, and pandas is shifting values.
The original file:
The contents of the pandas dataframe:
As you can see, some of the values from the second column have been read into the first column. How can I make sure that the values are read into the correct column when one of the columns has merged cells?
Below is the code I'm using to read the files:
rawFileDir = 'C:/ftproot/Projects/Korea/Data/AL_Seg/Domestic'
rawFiles = os.listdir(rawFileDir)
for rawFile in rawFiles:
if not os.path.isfile(rawFile):
continue
xl = pandas.read_html(rawFile)

Unable to get correct output from tsv file using pandas

I have a tsv file which I am trying to read by the help of pandas. The first two rows of the files are of no use and needs to be ignored. Although, when I get the output, I get it in the form of two columns. The name of the first column is Index and the name of second column is a random row from the csv file.
import pandas as pd
data = pd.read_csv('zahlen.csv', sep='\t', skiprows=2)
Please refer to the screenshot below.
The second column name is in bold black, which is one of the row from the file. Moreover, using '\t' as delimiter does not separate the values in different column. I am using Spyder IDE for this. Am I doing something wrong here?
Try this:
data = pd.read_table('zahlen.csv', header=None, skiprows=2)
read_table() is more suited for tsv files and read_csv() is a more specialized version of it. Then header=None will make first row data, instead of header.

Categories

Resources