I am new to Python and was recently given a problem to solve.
Briefly, I have a .csv file consisting of some data. I am supposed to read the .csv file and print out the first 5 column header names, followed by 5 rows of the data in the following format shown in the picture.
Results
Currently, I have written the code up to:
readfiles = file.readlines()
for i in readfiles:
data = i.strip()
print(data)
and have managed to churn out all the data. However, I am not too sure how I can get the 5 rows of data which is required by the problem. I am thinking if the .csv file should be converted into an array/list? Hoping someone can help me on this. Thank you.
I can't use pandas or csv for this by the way. =/
df = pd.read_csv('\#pathtocsv.csv')
df.head()
if you want it in list
needed_list = df.head().tolist()
First of all, if you want to read a csv file, you can use pandas library to do it.
import pandas as pd
df = pd.read_csv("path/to/your/file")
print(df.columns[0:5]) # print first 5 column names
print(df.head(5)) # Print first 5 rows
Or if you want it to do without pandas then,
rows = []
with open("path/to/file.csv", "r") as fl:
rows = [x.split(",") for x in fl.read().split("\n")]
print(rows[0][0:5]) # print first 5 column names
print(rows[0:5]) # print first 5 rows
Related
I have a .txt file with 170k rows. I am importing the txt file into pandas.
Each row has a number of values separated by a comma.
I want to extract the rows with 9 values.
I am currently using:
data = pd.read_csv('uart.txt', sep=",")
The first thing you should try - preprocess the file.
import csv
with open('uart.txt', 'r') as inp, open('uart_processed.txt', 'w') as outp:
inp_csv = csv.reader(inp)
outp_csv = csv.writer(outp)
for row in inp_csv:
if len(row) == 9:
outp_csv.writerow(row)
There can be more efficient way to do that, but it the simplest thing you can do and it entirely removes invalid rows.
As #ksooklall answered, if you need only 2 columns for simplicity:
[a,b,c,d] will be in your DataFrame as [a, b]
[e] as [e, Nan]
So, if you're ok with that - go ahead and no preprocessing required.
If you know the names of the 9 columns, you can do:
df = pd.read_csv('uart.txt', names='abcdefghj')
This will only read the first 9 columns.
As long as your header rows are fine,
you can use data = pd.read_csv('uart.txt', sep=",", error_bad_lines=False, warn_bad_lines=True)
This will ignore any lines having more than the desired amount of values and will also show which such lines were skipped.
If you know the rest of the actual data (i.e. lines that have 9 values) don't have any missing values in them, then you can dropna after reading it in to drop all rows that have less than 9 records. i.e. (data = pd.read_csv('uart.txt', sep=",", error_bad_lines=False, warn_bad_lines=True).dropna()
However, if the records that have 9 values can have NAs (e.g. 242,2421,,,,,,,1) then I don't think there's a built-in way in Pandas and you'd have to pre-process the csv before reading it in.
I have a CSV file, and i want to delete some rows of it based on the values of one of the columns. I do not know the related code to delete the specific rows of a CSV file which is in type pandas.core.frame.DataFrame.
I read related questions, and i found that people suggest writing every line that is acceptable in a new file. I do not want to do that. The thing that i want is:
1) to delete the rows that I know the index of them (number of the row)
or
2) to make a new CSV in the memory of the python (not to write and again read it )
Here's an example of what you can do with pandas. If you need more detail, you might find Indexing and Selecting Data a helpful resource.
import pandas as pd
from io import StringIO
mystr = StringIO("""speed,time,date
12,22:05,1
15,22:10,1
13,22:15,1""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr)
# convert time column to timedelta, assuming mm:ss
df['time'] = pd.to_timedelta('00:'+df['time'])
# filter for >= 22:10, i.e. second item
df = df[df['time'] >= df['time'].loc[1]]
print(df)
speed time date
1 15 00:22:10 1
2 13 00:22:15 1
I have a very large csv file with millions of rows and a list of the row numbers that I need.like
rownumberList = [1,2,5,6,8,9,20,22]
I know there is something called skiprows that helps to skip several rows when reading csv file like that
df = pd.read_csv('myfile.csv',skiprows = skiplist)
#skiplist would contain the total row list deducts rownumberList
However, since the csv file is very large, directly selecting the rows that I need could be more efficient. So I was wondering are there any methods to select rows when using read_csv? Not try to select rows using dataframe afterwards, since I try to minimize the time of reading file.Thanks.
There is a parameter called nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files (Docs)
pd.read_csv(file_name,nrows=int)
In case you need some part in the middle. Use both skiprows as well as nrows in read_csv.if skiprows indicate the beginning rows and nrows will indicate the next number of rows after skipping eg.
Example:
pd.read_csv('../input/sample_submission.csv',skiprows=5,nrows=10)
This will select data from the 6th row to 16 row
Edit based on comment:
Since there is a list this one might help i.e
li = [1,2,3,5,9]
r = [i for i in range(max(li)) if i not in li]
df = pd.read_csv('../input/sample_submission.csv',skiprows=r,nrows= max(li))
# This will skip the rows you dont want as well as limit the number of rows to maximum of the list.
import pandas as pd
rownumberList = [1,2,5,6,8,9,20,22]
df = pd.read_csv('myfile.csv',skiprows=lambda x: x not in rownumberList)
for pandas 0.25.1, pandas read_csv, you can pass callable function to skiprows
I am not sure about read_csv() from Pandas (there is though a way to use an iterator for reading a large file in chunks), but you can read the file line by line (lazy-loading, not reading the whole file in memory) with csv.reader (or csv.DictReader), leaving only the desired rows with the help of enumerate():
import csv
import pandas as pd
DESIRED_ROWS = {1, 17, 28}
with open("input.csv") as input_file:
reader = csv.reader(input_file)
desired_rows = [row for row_number, row in enumerate(reader)
if row_number in DESIRED_ROWS]
df = pd.DataFrame(desired_rows)
(assuming you would like to pick random/discontinuous rows and not a "continuous chunk" from somewhere in the middle - in that case #James's idea to have "start and "stop" would work generally better).
import pandas as pd
df = pd.read_csv('Data.csv')
df.iloc[3:6]
Returns rows 3 through 5 and all columns.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
From de documentation you can see that skiprows can take an integer or a list as values to remove some lines.
So basicaly you can tell it to remove all but those you want. For this you first need to know the number in lines in the file (best if you know beforehand) by open it and counting as following:
with open('myfile.csv') as f:
row_count = sum(1 for row in f)
Now you need to create the complementary list (here are sets but also works, don't know why). First you create the one from 1 to the number of rows and then substract the numbers of the rows you want to read.
skiplist = set(range(1, row_count+1)) - set(rownumberList)
Finally you can read the csv as normal.
df = pd.read_csv('myfile.csv',skiprows = skiplist)
here is the full code:
import pandas as pd
with open('myfile.csv') as f:
row_count = sum(1 for row in f)
rownumberList = [1,2,5,6,8,9,20,22]
skiplist = set(range(1, row_count+1)) - set(rownumberList)
df = pd.read_csv('myfile.csv', skiprows=skiplist)
you could try this
import pandas as pd
#making data frame from a csv file
data = pd.read_csv("your_csv_flie.csv", index_col ="What_you_want")
# retrieving multiple rows by iloc method
rows = data.iloc [[1,2,5,6,8,9,20,22]]
You will not be able to circumvent the read time when accessing a large file. If you have a very large CSV file, any program will need to read through it at least up to the point where you want to begin extracting rows. Really, that is what databases are designed for.
However, if you want to extract rows 300,000 to 300,123 from a 10,000,000 row CSV file, you are better off reading just the data you need into Python before converting it to a data frame in Pandas. For this you can use the csv module.
import csv
import pandas
start = 300000
stop = start + 123
data = []
with open('/very/large.csv', 'r') as fp:
reader = csv.reader(fp)
for i, line in enumerate(reader):
if i >= start:
data.append(line)
if i > stop:
break
df = pd.DataFrame(data)
for i in range (1,20)
the first parameter is the first row and the last parameter is the last row...
So I've currently got a dataset that has a column called 'logid' which consists of 4 digit numbers. I have about 200k rows in my csv files and I would like to count each unique logid and output it something like this;
Logid | #ofoccurences for each unique id. So it might be 1000 | 10 meaning that the logid 1000 is seen 10 times in the csv file column 'logid'. The separator | is not necessary, just making it easier for you guys to read. This is my code currently:
import pandas as pd
import os, sys
import glob
count = 0
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
counts = df['my_data'].value_counts()
counts
Using this I get a weird output that I dont quite understand:
4 16463
10013 490
pserverno 1
Name: my_data, dtype: int64
I know I am doing something wrong in the last line
counts = df['my_data'].value_counts()
but I am not too sure what. For reference the values I am extracting are from row C in the excel file (so I guess thats column 3?) Thanks in advance!
ok. from my understanding. I think csv file may be like this.
row1,row1,row1
row2,row2,row2
row3,row3,row3
logid,header1,header2
1000,a,b
1001,c,d
1000,e,f
1001,g,h
And I have all done this with this format of csv file like
# skipping the first three row
df = pd.read_csv("file_name.csv", skiprows=3)
print(df['logid'].value_counts())
And the output look like this
1001 2
1000 2
Hope this will help.
update 1
df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
in this line the parameter names = ['my_data'] creates a new header of the data frame. As your csv file has header row so you can skip this parameter. And as the main header you want to row3 so you can skip first three row. And last one thing you are reading all csv file in given path. So be conscious all of the csv files are same format. Happy codding.
I think you need create one big DataFrame by append all df to list and then concat first:
dfs = []
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, usecols=['logid'], low_memory=False)
dfs.append(df)
df = pd.concat(dfs)
Then use value_counts - output is Series. So for 2 column DataFrame need rename_axis with reset_index:
counts = df['my_data'].value_counts().rename_axis('my_data').reset_index(name='count')
counts
Or groupby and aggregate size:
counts = df.groupby('my_data').size().reset_index(name='count')
counts
you may try this.
counts = df['logid'].value_counts()
Now the "counts" should give you the count of each value.
Help is greatly appreciated!
I have a CSV that looks like this:
CSV example
I am writing a program to check that each column holds the correct data type. For example:
Column 1 - Must have valid time stamp
Column 2 - Must hold the value 2
Column 4 - Must be consecutive (If not how many packets missing)
Column 5/6 - Calculation done on both values and outcome must much inputted value
The columns can be in different positions.
I have tried using the pandas module to give each column an 'id' using the pandas module:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
print df.keys()
print df.star_name
However when doing the checks on the data it seems get confused. What would be the next best approach to do something like this?
I have really been killing myself over this and any help would be appreciated.
Thank you!
Try using the 'csv' module.
Example
import csv
with open('data.csv', 'r') as f:
# The first line of the file is assumed to contain the column names
reader = csv.DictReader(f)
# Read one row at a time
# If you need to compare with the previous row, just store that in a variable(s)
prev_title_4_value = 0
for row in reader:
print(row['Title 1'], row['Title 3'])
# Sample to illustrate how column 4 values can be compared
curr_title_4_value = int(row['Title 4'])
if (curr_title_4_value - prev_title_4_value) != 1:
print 'Values are not consecutive'
prev_title_4_value = curr_title_4_value