Pandas read_csv: Columns are being imported as rows - python

Edit: I believe this was all user error. I have been typing df.T by default, and it just occurred to me that this is very likely the TRANSPOSE output. By typing df, the data frame is output normally (headers as columns). Thank you for those who stepped up to try and help. In the end, it was just my misunderstanding of pandas language..
Original Post
I'm not sure if I am making a simple mistake but the columns in a .csv file are being imported as rows using pd.read_csv. The dataframe turns out to be 5 rows by 2000 columns. I am importing only 5 columns out of 14 so I set up a list to hold the names of the columns I want. They match exactly those in the .csv file. What am I doing wrong here?
import os
import numpy as np
import pandas as pd
fp = 'C:/Users/my/file/path'
os.chdir(fp)
cols_to_use = ['VCOMPNO_CURRENT', 'MEASUREMENT_DATETIME',
'EQUIPMENT_NUMBER', 'AXLE', 'POSITION']
df = pd.read_csv('measurement_file.csv',
usecols=cols_to_use,
dtype={'EQUIPMENT_NUMBER': np.int,
'AXLE': np.int},
parse_dates=[2],
infer_datetime_format=True)
Output:
0 ... 2603
VCOMPNO_CURRENT T92656 ... T5M247
MEASUREMENT_DATETIME 7/26/2018 13:04 ... 9/21/2019 3:21
EQUIPMENT_NUMBER 208 ... 537
AXLE 1 ... 6
POSITION L ... R
[5 rows x 2000 columns]
Thank you.
Edit: To note, if I import the entire .csv with the standard pd.read_csv('measurement_file.csv'), the columns are imported properly.
Edit 2: Sample csv:
VCOMPNO_CURRENT,MEASUREMENT_DATETIME,REPAIR_ORDER_NUMBER,EQUIPMENT_NUMBER,AXLE,POSITION,FLANGE_THICKNESS,FLANGE_HEIGHT,FLANGE_SLOPE,DIAMETER,RO_NUMBER_SRC,CL,VCOMPNO_AT_MEAS,VCOMPNO_SRC
T92656,10/19/2018 7:11,5653054,208,1,L,26.59,27.34,6.52,691.3,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,1,R,26.78,27.25,6.64,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,L,26.6,27.13,6.49,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,R,26.61,27.45,6.75,691.6,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T7L672,10/19/2018 7:11,5653054,208,3,L,26.58,27.14,6.58,644.4,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH
T7L672,10/19/2018 7:11,5653054,208,3,R,26.21,27.44,6.17,644.5,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH

A simple workaround here is to just to take the transpose of the dataframe.
Link to Pandas Documentation
df = pd.DataFrame.transpose(df)

Can you try like this?
import pandas as pd
dataset = pd.read_csv('yourfile.csv')
#filterhere
dataset = dataset[cols_to_use]

Related

Use python and pandas to set key for imported data from text file to dataframe

This feels like an incredibly straight forward problem, but I am new and stuck, apologies.
It doesn't necessarily need a key, but that was how I thought to solve it.
I have a text file whose abbreviated contents resemble this:
name_of_source
128 1024.000000 225.569918
name_of_source_2
140 1120.000000 229.085200
etc etc
I really need the output dataframe to resemble:
name_of_source 128 1024.000000 225.569918
name_of_source_2 140 1120.000000 229.085200
I'm struggling to overcome the linebreak between the name and the data
import pandas as pd
import os
data= pd.read_csv(path+'combined.txt', header=None, sep = "\s+|\t+|\s+\t+|\t+\s+", names='name vol1 vol2 vol3'.split(' '))
You can use pandas.DataFrame.join:
df= pd.read_csv("test.txt", header=None)
out= (
df.rename(columns= {0: "Name"})
.join(df.shift(-1).rename(columns={0: "Vals"}))
.iloc[::2]
)
# Output :
print(out)
Name Vals
0 name_of_source 128 1024.000000 225.569918
2 name_of_source_2 140 1120.000000 229.085200
If need separate values, use pandas.Series.str.split with pandas.concat :
print(pd.concat([out, out.pop("Vals").str.split(expand=True).add_prefix('Vals_')], axis=1))
Name Vals_0 Vals_1 Vals_2
0 name_of_source 128 1024.000000 225.569918
2 name_of_source_2 140 1120.000000 229.085200
# .txt used:

How to get values from first column of excel file as array?

Hi I want to get values from first column of excel file as array.
Already I wrote this code-
import os
import pandas as pd
for file in os.listdir("./python_files"):
if file.endswith(".xlsx"):
df = pd.read_excel(os.path.join("./python_files", file))
print(df.iloc[:,1])
What i got in output now
0 172081
1 163314
2 173547
3 284221
4 283170
...
3582 163304
3583 160560
3584 166961
3585 161098
3586 162499
Name: Organization CRD#, Length: 3587, dtype: int64
What I whish to get
172081
163314
173547
284221
283170
...
163304
160560
166961
161098
162499
Can somebody help? Thanks :D
You will just need to use the tolist function from pandas:
import os
import pandas as pd
for file in os.listdir("./python_files"):
if file.endswith(".xlsx"):
df = pd.read_excel(os.path.join("./python_files", file))
print(df.iloc[:,1].tolist())
Output
172081
163314
173547
284221
283170
...
163304
160560
166961
161098
162499
As an array:
df.iloc[:,1].values
As a list:
df.iloc[:,1].values.tolist()

Turning a pandas dataframe from Excel file (xlrd) to a list

I am new to Python. I read data from an excel file. How may I turn the column into a list? The column is part of a pandas dataframe, read from an xlsx file by xlrd package. Any better way to solve the problem is also appreciated.
import pandas as pd
import xlrd
workbook = xlrd.open_workbook("MyData_XYZ.xlsx")
sheet1 = workbook.sheet_by_index(0)
def get_cell_range2(sheet, start_col, start_row, end_col, end_row):
return [sheet.row_slice(row, start_colx=start_col-1, end_colx=end_col) for row in range(start_row-1, end_row)]
er_aaa = get_cell_range2(sheet1, 1, 2, 2, 67)
er_aaa_df = pd.DataFrame(er_aaa, columns= ['date', 'aaa'])
raw_seq = list(er_aaa_df['aaa'])
I got this in Spyder
raw_seq
Out[61]:
0 number:25.405
1 number:25.427
2 number:25.411
3 number:25.423
4 number:25.45
61 number:26.054
62 number:26.09
63 number:26.103
64 number:26.1
65 number:26.03
Name: aaa, Length: 66, dtype: object
How can I turn the result to a simple list, namely,
[25.405, 25.427, 25.411, ...... 26.03]
Thank you!!
If I understand it correctly, you want to read data from xlsx and get one of its columns. You can get it such as below.
df = pd.read_excel("MyData_XYZ.xlsx")
list = df['aaa'].tolist()
Why don't you just use: data=pd.read_excel("MyData_XYZ.xlsx", header=1), see this link
Then you can just select your values with data['number']. But carefull this is still a dataframe object, if you want a pure list, just do list( data['number'])

Problems Sorting Data out of a text-file

I have a csv file imported into a dataframe and have trouble sorting the data.
df looks like this:
Data
0 <WindSpeed>0.69</WindSpeed>
1 <PowerOutput>0</PowerOutput>
2 <ThrustCoEfficient>0</ThrustCoEffici...
3 <RotorSpeed>8.17</RotorSpeed>
4 <ReactivePower>0</ReactivePower>
5 </DataPoint>
6 <DataPoint>
7 <WindSpeed>0.87</WindSpeed>
8 <PowerOutput>0</PowerOutput
I want it to look like this:
0 Windspeed Poweroutput
1 0.69 0.0
HereĀ“s the code that I wrote so far:
import pandas as pd
from pandas.compat import StringIO
import re
import numpy as np
df= pd.read_csv('powercurve.csv', encoding='utf-8',skiprows=42)
df.columns=['Data']
no_of_rows=df.Data.str.count("WindSpeed").sum()/2
rows=no_of_rows.astype(np.uint32)
TRBX=pd.DataFrame(index=range(0,abs(rows)),columns=['WSpd[m/s]','Power[kW]'],dtype='float')
i=0
for i in range(len(df)):
if 'WindSpeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
elif 'Rotorspeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
Is this a suitable approach? If yes, so far there are no values written into the TRBX dataframe. Where is my mistake?
The code below should help you if your df is indeed in the same format as you:
import re
split_func = lambda x: re.split('<|>', str(x))
split_series = df.Data.apply(split_func)
data = a.apply(lambda x: x[2]).rename('data')
features = a.apply(lambda x: x[1]).rename('features')
df = pd.DataFrame(data).set_index(features).T
You may want to drop some columns that have no data or input some N/A values afterwards. You also may want to rename the variables and series to different names that make more sense to you.

How to read columns from different files and plot?

I have data of concentrations for every day of year 2005 until 2018. I want to read three columns of three different files and combine them into one, so I can plot them.
Data:file 1
time, mean_OMNO2d_003_ColumnAmountNO2CloudScreened
2005-01-01,-1.267651e+30
2005-01-02,4.90778397e+15
...
2018-12-31,-1.267651e+30
Data:file 2
time, OMNO2d_003_ColumnAmountNO2TropCloudScreened
2005-01-01,-1.267651e+30
2005-01-02,3.07444147e+15
...
Data:file 3
time, OMSO2e_003_ColumnAmountSO2_PBL
2005-01-01,-1.267651e+30
2005-01-02,-0.0144000314
...
I want to plot time and mean_OMNO2d_003_ColumnAmountNO2CloudScreened, OMNO2d_003_ColumnAmountNO2TropCloudScreened, OMSO2e_003_ColumnAmountSO2_PBL into one graph.
import glob
import pandas as pd
file_list = glob.glob('*.csv')
no= []
no2=[]
so2=[]
for f in file_list:
df= pd.read_csv(f, skiprows=8, parse_dates =['time'], index_col ='time')
df.columns=['no','no2','so2']
no.append([df["no"]])
no2.append([df["no2"]])
so2.append([df["so2"]])
How do I solve the problem?
This is very doable. I had a similar problem with 3 files all in one plot. My understanding is that you want to compare levels of NO, NO2, and SO2, that each column is in comparable order, and that you want to compare across rows. If you are ok with importing matplotlib and numpy, something like this may work for you:
import numpy as np
import matplotlib as plt
NO = np.asarray(df["no1"])
NO2 = np.asarray(df["no2"]))
SO2 = np.asarray(df["so2"))
timestamp = np.asarray(df["your_time_stamp"])
plt.plot(timestamp, NO)
plt.plot(timestamp, NO2)
plt.plot(timestamp, SO2)
plt.savefig(name_of_plot)
This will need some adjusting for your specific data frame, but I hope you see what I am getting at!

Categories

Resources