Creating an object fro csv data - python

I want to create an object(maybe list?? still waiting for suggestions) that contains data read from a .csv file
The data looks like this:
[‘;;;;;;;;;;;;;;Number;;Semester;;Grade;;;’]
[';;;;;;;;;;;;;;1;;I;;A;;;']
[‘;;;;;;;;;;;;;;2;;I;;C;;;']
[';;;;;;;;;;;;;;3;;II;;A;;;']
I'm thinking this could be solved using regex.
The idea is that the first row determines the order in which 'number, semester, grade' appears and the next rows are what we need to store.

How about using pandas and the read_csv?
import pandas
pandas.read_csv('file.csv', sep=',',header=1)
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Related

Cleaning a messy Dataset

I got an originally txt file converted to csv.
I have the column names but there is practically one row in the
unprocessed dataset.
How do I clean the dataset using pandas,numpy exc.methods so that each string/int between every comma will be placed in seperated column with the proper column name?
Thanks,
Ido
cols = ['AIRLINE_ID','AIRLINE_NAME','ALIAS','IATA','ICAO','CALLSIGN','COUNTRY','ACTIVE'
]
Airlines_raw_dataset
I looked for videos regarding this topic on youtube but I didn't encounter specific info for this highly dirty dataset.
Pandas Has a built in method for reading csv files. It may be used in this fashion
df = pd.read_csv('filename.csv')
You can read more about this method here -> Official Docs

What is the fastest way to retrieve header names from excel files using pandas

I have a big size excel files that I'm organizing the column names into a unique list.
The code below works, but it takes ~9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)
Using pandas to extract just the column names of a large excel file is very inefficient.
You can use openpyxl for this:
from openpyxl import load_workbook
wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)
columns = {}
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
columns = value
Assuming you only have one sheet, you will get a tuple of column names here.
If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().
Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))
The stronger suggestion is to change datatype if you specifically want to address this issue.

Reading in a multiindex .csv file as returned from pandas using the ftable type in R

I have a multi-index (multi-column to be exact) pandas data frame in Python that I saved using the .to_csv() method. Now I would like to continue my analysis in R. For that I need to read in the .csv file. I know that R does not really support multi-index data frames like pandas does but it can handle ftables using the stats package. I tried to use read.ftable() but I can't figure out how to set the arguments right to correctly import the .csv file.
Here's some code to create a .csv file that has the same structure as my original data:
require(stats)
# create example .csv file with a multiindex as it would be saved when using pandas
fileConn<-file('test.csv')
long_string = paste("col_level_1,a,b,c,d\ncol_level_2,cat,dog,tiger,lion\ncol_level_3,foo,foo,foo,foo\nrow_level_1,,,,\n1,",
"\"0,525640810622065\",\"0,293400380474675\",\"0,591895790442417\",\"0,675403394728461\"\n2,\"0,253176104907883\",",
"\"0,107715459748816\",\"0,211636325794272\",\"0,618270276545688\"\n3,\"0,781049927692169\",\"0,72968971635063\",",
"\"0,913378426593516\",\"0,739497259262532\"\n4,\"0,498966730971063\",\"0,395825713762063\",\"0,252543611974303\",",
"\"0,240732390893718\"\n5,\"0,204075522469035\",\"0,227454178487449\",\"0,476571725142606\",\"0,804041968683541\"\n6,",
"\"0,281453400066927\",\"0,010059089264751\",\"0,873336799707968\",\"0,730105129502755\"\n7,\"0,834572206714808\",",
"\"0,668889079581709\",\"0,516135581764696\",\"0,999861473609101\"\n8,\"0,301692961056344\",\"0,702428450077691\",",
"\"0,211660363912457\",\"0,626178589354395\"\n9,\"0,22051883447221\",\"0,934567760412661\",\"0,757627523007149\",",
"\"0,721590060307143\"",sep="")
writeLines(long_string, fileConn)
close(fileConn)
When opening the .csv file in a reader of your choice, it should look like this:
How can I read this in using R?
I found one solution without using read.ftable() based on this post. Not that this won't give you the data in the ftable format:
headers <- read.csv(file='./test.csv',header=F,nrows=3,as.is=T,row.names=1)
dat <- read.table('./test.csv',skip=4,header=F,sep=',',row.names=1)
headers_collapsed <- apply(headers,2,paste,collapse='.')
colnames(dat) <- headers_collapsed

What code should I use in extracting specific column (with specific data) from a csv file to python. It can be either pandas or numpy

please see attached photo
here's the image
I only need to import a specific column with conditions(such as specific data found in that column). And also, I only need to remove unnecessary columns. dropping them takes too much code. What specific code or syntax is applicable?
How to get a column from pandas dataframe is answered in Read specific columns from a csv file with csv module?
To quote:
Pandas is spectacular for dealing with csv files, and the following
code would be all you need to read a csv and save an entire column
into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
So in your case, you just save the the filtered data frame in a new variable.
This means you do newdf = data.loc[...... and then use the code snippet from above to extract the column you desire, for example newdf.continent

Options for creating csv structure I can work with

My task is to take an output from a machine, and convert that data to json. I am using python, but the issue is the structure of the output.
From my research online, csv usually has the first row with the keys and the values in the same order underneath. Example: https://e.nodegoat.net/CMS/upload/guide-import_person_csv_notepad.png
However, the output from my machine doesn't look like this.
Mine looks like:
Date:,10/10/2015
Name:,"Company name"
Location:,"Company location"
Serial num:,"Serial number"
So the machine i'm working with is outputting each result on a new .dat file instead of appending onto a single csv with the row of keys and whatnot. Technically, yes the data is separated with csv, but not sure how to work with this.
How should I go about turning this kind of data to json? Should I look into restructuring the data to the default csv? Or is there a way I can work with this and not do any cleanup to convert this? In either case, any direction is appreciated.
You can try transpose using pandas
import pandas as pd
from io import StringIO
data = '''\
Date:,10/10/2015
Name:,"Company name"
Location:,"Company location"
Serial num:,"Serial number"
'''
f = StringIO(data)
df = pd.read_csv(f)
t = df.set_index(df.columns[0]).T
print(t['Location:'][0])
print(t['Serial num:'][0])

Categories

Resources