Loading csv into pandas dataframe only reads 6 columns - python

I'm trying to read a csv file into a dataframe. The original csv has about 111 columns (not all needed, I want to drop some columns later) but on loading the data, the resultant dataframe only has 6 columns, though has the correct amount of rows (~300).
I have double checked this by setting the options to display the whole dataframe.
Any ideas why this might be?
I can't see anything wrong with the csv, this is the header row of the csv (sorry its so long)
Edit:
This is the code I have used:
processdf = pd.read_csv(thefile, compression='gzip')
timestamp,temp1,temp2,temp3,temp4,valveOpen,sontexTimestamp,sontexModbusAddress,sontexModbusParity,sontexModbusFlowControl,sontexModbusStopBits,sontexModbusCustomID,sontexModbusFabricationNo,sontexModbusFirmwareVer,sontexModbusBaudRate,sontexModbusRunningHours,sontexErrors,sontexEnergyUnit,sontexEnergyDecimals,sontexEnergyTotalizerHeating,sontexEnergyTotalizerTariff1,sontexEnergyTotalizerTariff2,sontexEnergyStoredST1,sontexEnergyTariff1ST1,sontexEnergyTariff2ST1,sontexEnergyStoredST2,sontexEnergyTariff1ST2,sontexEnergyTariff2ST2,sontexVolumeUnit,sontexVolumeDecimals,sontexVolume,sontexVolumeTariff1,sontexVolumeTariff2,sontexVolumeStoredST1,sontexVolumeTariff1ST1,sontexVolumeTariff2ST1,sontexVolumeStoredST2,sontexVolumeTariff1ST2,sontexVolumeTariff2ST2,sontexFlowUnit,sontexFlowDecimals,sontexFlow,sontexTempHighUnit,sontexTempHighDecimals,sontexTempHigh,sontexTempLowUnit,sontexTempLowDecimals,sontexTempLow,sontexTempDiffUnit,sontexTempDiffDecimals,sontexTempDiff,sontexOK,alphaSerial,alphaTotalToGrid,alphaTotalFromGrid,alphaTotalApparentPower,alphaTotalPowerFactor,alphaBatterySOC,alphaBatteryCharging,alphaBatteryDischarging,alphaBatteryChargeRelay,alphaBatteryDischargeRelay,alphaBmuSoftwareVersion,alphaLmuSoftwareVersion,alphaIsoSoftwareVersion,alphaBatteryNum,alphaBatteryCapacity,alphaBatteryType,alphaBatterySoh,alphaBatteryWarning,alphaBatteryFault,alphaBatteryChargeEnergy,alphaBatteryDischargeEnergy,alphaBatteryEnergyChargeFromGrid,alphaBatteryPower,alphaBatteryRemainingMins,alphaBatteryImpChargeSoc,alphaBatteryImpDischargeSoc,alphaBatteryRemainChargeSoc,alphaBatteryRemainDischargeSoc,alphaBatteryMaxChargePower,alphaBatteryMaxDischargePower,alphaBatteryMosClosed,alphaBatterySocCalibEnabled,alphaBatterySingleCutError,alphaBatteryFault1,alphaBatteryFault2,alphaBatteryFault3,alphaBatteryFault4,alphaBatteryFault5,alphaBatteryFault6,alphaBatteryWarning1,alphaBatteryWarning2,alphaBatteryWarning3,alphaBatteryWarning4,alphaBatteryWarning5,alphaBatteryWarning6,alphaBatteryOK,obVoltage,obFrequency,obCurrent,obActivePower,obApparentPower,obReactivePower,obPowerFactor,obImportActiveEnergy,obImportReactiveEnergy,obExportActiveEnergy,obExportReactiveEnergy,obTotalActiveEnergy,obOK

Related

after writing Pyspark dataFrame to csv, half of rows are missing

I tried to create a sample data in spark DataFrame, and write the resulting sample into csv file. number of records in dataFrame is 1000000 but once I wrote the DataFrame into a csv file, only 462690 rows have been written.
I've tried different options in write method but the problem still remains. even tried to convert spark dataframe to pandas dataframe, the same problem happens in this case too.
df.write.options(header='True', delimiter = ',' , quoteAll= True) \
.csv("PresSample2.csv")
or
df2 = df.select("*").toPandas()

Pandas read_csv() taking the last row as col header instead of first?

I am trying to read a csv file over 70,000 rows using pandas:
df = pd.read_csv("test.csv", encoding='ISO 8859-1')
In this CSV file, the first row contains my column headers. However, when I run the above code and df.info(), I see that the DF headers are taking the last row of my CSV file. Also, instead of having 70,000 rows of data in the DF, there are only 146 rows.
What could be the reason that this is happening? It is odd because the same csv file and script work perfectly fine for my co-worker. Many thanks for your help please! :)

Python excel to csv copying column data with different header names

So here is my situation. Using Python I want to copy specific columns from excel spreadsheet into specific columns into a csv worksheet.
The pre-filled column header names are named differently in each spreadsheet and I need to use a sublist as a parameter.
For example, in the first sublist, data column in excel needs to be copied from/to:
spreadsheet csv
"scan_date" => "date_of_scan"
Two sublists as parameters: one of names copied from excel, one of names of where to paste into csv.
Not sure if a dictionary sublist would be better than two individual sublists?
Also, the csv column header names are in row B (not row A like excel) which has complicated things such as data frames.
So, ideally I would like to have sublists converted to arrays,
spreadsheet iterates columns to find "scan_date"
copies data
iterates to find "date_of_scan" in csv
paste data
moves on to the second item in the sublists and repeats.
I've tried pandas and openpyxl and just can't seem to figure out the approach/syntax of how to do it.
Any help would be greatly appreciated.
Thank you.
Clarification edit:
The csv file has some preexisting data within. Also, I cannot change the headers into different columns. So, if "date_of_scan" is in column "RF" then it must stay in column "RF". I was able to copy, say, the 5 columns of data from excel into a temp spreadsheet and then concatenate into the csv but it always moved the pasted columns to the beginning of the csv document (columns A, B, C, D, E).
It is hard to know the answer without seeing you specific dataset, but it seems to me that a simpler approach might be to simply make your excel sheet a df, drop everything except the columns you want in the csv then write a csv with pandas. Here's some psuedo-code.
import pandas as pd
df=pd.read_excel('your_file_name.xlsx')
drop_cols=[,,,] #list of columns to get rid of
df.drop(drop_cols,axis='columns')
col_dict={'a':'x','b':'y','c':'z'} #however you want to map you new columns in this example abc are old columns and xyz are new ones
#this line will actually rename your columns with the dictionary
df=df.rename(columns=col_dict)
df.to_csv('new_file_name.csv') #write new file
and this will actually run in python, but I created the df from dummy data instead of an excel file.
#with dummy data
df=pd.DataFrame([0,1,2],index=['a','b','c']).T
col_dict={'a':'x','b':'y','c':'z'}
df=df.rename(columns=col_dict)
df.to_csv('new_file_name.csv') #write new file

Keeping column alignment when reading merged cells in pandas

I am trying to read a bunch of tables into a dataframe using pandas. The files have an extension of .xls, but appear to be in HTML format, so I'm using the pandas.read_html() function. The issue I face is that the first column contains merged cells, and pandas is shifting values.
The original file:
The contents of the pandas dataframe:
As you can see, some of the values from the second column have been read into the first column. How can I make sure that the values are read into the correct column when one of the columns has merged cells?
Below is the code I'm using to read the files:
rawFileDir = 'C:/ftproot/Projects/Korea/Data/AL_Seg/Domestic'
rawFiles = os.listdir(rawFileDir)
for rawFile in rawFiles:
if not os.path.isfile(rawFile):
continue
xl = pandas.read_html(rawFile)

Unable to get correct output from tsv file using pandas

I have a tsv file which I am trying to read by the help of pandas. The first two rows of the files are of no use and needs to be ignored. Although, when I get the output, I get it in the form of two columns. The name of the first column is Index and the name of second column is a random row from the csv file.
import pandas as pd
data = pd.read_csv('zahlen.csv', sep='\t', skiprows=2)
Please refer to the screenshot below.
The second column name is in bold black, which is one of the row from the file. Moreover, using '\t' as delimiter does not separate the values in different column. I am using Spyder IDE for this. Am I doing something wrong here?
Try this:
data = pd.read_table('zahlen.csv', header=None, skiprows=2)
read_table() is more suited for tsv files and read_csv() is a more specialized version of it. Then header=None will make first row data, instead of header.

Categories

Resources