Help is greatly appreciated!
I have a CSV that looks like this:
CSV example
I am writing a program to check that each column holds the correct data type. For example:
Column 1 - Must have valid time stamp
Column 2 - Must hold the value 2
Column 4 - Must be consecutive (If not how many packets missing)
Column 5/6 - Calculation done on both values and outcome must much inputted value
The columns can be in different positions.
I have tried using the pandas module to give each column an 'id' using the pandas module:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
print df.keys()
print df.star_name
However when doing the checks on the data it seems get confused. What would be the next best approach to do something like this?
I have really been killing myself over this and any help would be appreciated.
Thank you!
Try using the 'csv' module.
Example
import csv
with open('data.csv', 'r') as f:
# The first line of the file is assumed to contain the column names
reader = csv.DictReader(f)
# Read one row at a time
# If you need to compare with the previous row, just store that in a variable(s)
prev_title_4_value = 0
for row in reader:
print(row['Title 1'], row['Title 3'])
# Sample to illustrate how column 4 values can be compared
curr_title_4_value = int(row['Title 4'])
if (curr_title_4_value - prev_title_4_value) != 1:
print 'Values are not consecutive'
prev_title_4_value = curr_title_4_value
Related
I am new to Python and was recently given a problem to solve.
Briefly, I have a .csv file consisting of some data. I am supposed to read the .csv file and print out the first 5 column header names, followed by 5 rows of the data in the following format shown in the picture.
Results
Currently, I have written the code up to:
readfiles = file.readlines()
for i in readfiles:
data = i.strip()
print(data)
and have managed to churn out all the data. However, I am not too sure how I can get the 5 rows of data which is required by the problem. I am thinking if the .csv file should be converted into an array/list? Hoping someone can help me on this. Thank you.
I can't use pandas or csv for this by the way. =/
df = pd.read_csv('\#pathtocsv.csv')
df.head()
if you want it in list
needed_list = df.head().tolist()
First of all, if you want to read a csv file, you can use pandas library to do it.
import pandas as pd
df = pd.read_csv("path/to/your/file")
print(df.columns[0:5]) # print first 5 column names
print(df.head(5)) # Print first 5 rows
Or if you want it to do without pandas then,
rows = []
with open("path/to/file.csv", "r") as fl:
rows = [x.split(",") for x in fl.read().split("\n")]
print(rows[0][0:5]) # print first 5 column names
print(rows[0:5]) # print first 5 rows
I have two large csv files and I want to compare column1 in csv1 with column1 in csv2. I was able to do this using Python List where I read csv1 and throw column1 in list1, do the same thing with csv2 and then check to see if element in list1 is present in list2
olist = []
def oldList(self):
for row in self.csvreaderOld:
self.olist.append(row[1])
nlist = []
def newList(self):
for row in self.csvreaderNew:
self.nlist.append(row[1])
def new_list(self):
return [item for item in self.olist if item not in self.nlist]
the code works but can a long time to complete. I am trying to see if I can use dictionary instead, see if that would be faster, so I can compare keys in dictionary1 exist in dictionary2 but so far havent been successfully owing to my limited knowledge.
If it's a big CSV file or your'e planning to continue working with tables, I would suggest doing it with the Pandas module.
To be honest, even if it's a small file, or you're not going to continue working with tables, Pandas is an excellent module.
From what I understand (and I might be mistaken), for reading CSV files, Pandas is one of the quickest libraries to do so.
import pandas as pd
df = pd.read_csv("path to your csv file", use_cols = ["column1", "column2"])
def new_list(df):
return [item for item in df["column2"].values if item not in df["column1"].values]
It's important to use .values when checking for an item in a pandas series (when you're extracting a column in a DataFrame you're getting a pandas series)
You could also use list(df["column1"]) and the other methods suggested in How to determine whether a Pandas Column contains a particular value for determining whether a value is contains in a pandas column
for example :
df = pd.DataFrame({"column1":[1,2,3,4], "column2":[2,3,4,5]})
the data frame would be
column1 column2
1 2
2 3
3 4
4 5
and new_line would return [5]
You can read both files into objects and compare in a single loop.
Here is a short code snippet for the idea (not class implementation):
fsOld = open('oldFile.csv', 'r')
fsNew = open('newFile.csv', 'r')
fsLinesOld = fsOld.readlines()
fsLinesNew = fsNew.readlines()
outList = []
# assumes lines are same for both files data:
for i in range(0, fsLinesOld.__len__(), 1):
if ( fsLinesOld[i] == fsLinesNew[i]):
outList.append(fsLinesOld[i])
First of all, change the way of reading the CSV files, if you want just one column mention that in usecols, like this
df = pd.read_csv("sample_file.csv", usecols=col_list)
And second, you can use set difference if you are not comparing row to row, like this
set(df.col.to_list()).difference(set(df2.col.to_list()))
I have a csv file that stores the following information on each line; name, phone number, class time, duration of the class. I am trying to store only the phone number from each line of the csv file into a list. I am currently trying to get it to work using regex, but if there are better suggestions, I am all ears. I am relatively new to coding python, so any other advice would be much appreciated.
'''
def get_numbers():
file = open("students.csv")
regex = r"(\d+)"
for row in file:
if row:
result = re.search(regex, row)
print(result[0])
'''
This is a sample of what each line in the csv file looks like:
James Example,611-544-3091,8:00pm,1hr
Carl Example,900-122-818,12:15pm,30 mins
There are quite a few ways to do this.
1
Pandas offers a very elegant solution. You can read the csv file, and extract only the phone numbers. Here is how to do it.
import pandas as pd
df = pd.read_csv('file.csv', names=['name', 'phone number', 'class time', 'duration'])
phno = df['Phone number'].tolist()
What this essentially does is, it takes your entire data and makes it into a table. Each line of your file corresponds to one row, and each entry in a line corresponds to a column entry. Once you make it into a table using the read_csv instruction, you can then extract any column. You require the 'Phone Number` column, thus you pick up that column using df['Phone Number'] and convert it into a list.
2
If you do not want to use pandas, here is another method.
for row in file:
phno = row.split(',')[1]
print(phno)
#or append it to some master list if you wish
The best way would be to use pandas
df = pd.read_csv("path/to/file.csv")
https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
It also lets you easily manipulate rows and a lot more. And there is plenty of tutorials etc. on the web.
The Python standard library has a csv module which is intended for exactly this; you can use either csv.reader or csv.DictReader:
import csv
def get_numbers():
with open("students.csv") as fh:
for row in csv.reader(fh):
if not row:
# row is empty; skip
continue
# unpack the row into four variables
name, number, time, duration = row
print(number)
I have a CSV file, and i want to delete some rows of it based on the values of one of the columns. I do not know the related code to delete the specific rows of a CSV file which is in type pandas.core.frame.DataFrame.
I read related questions, and i found that people suggest writing every line that is acceptable in a new file. I do not want to do that. The thing that i want is:
1) to delete the rows that I know the index of them (number of the row)
or
2) to make a new CSV in the memory of the python (not to write and again read it )
Here's an example of what you can do with pandas. If you need more detail, you might find Indexing and Selecting Data a helpful resource.
import pandas as pd
from io import StringIO
mystr = StringIO("""speed,time,date
12,22:05,1
15,22:10,1
13,22:15,1""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr)
# convert time column to timedelta, assuming mm:ss
df['time'] = pd.to_timedelta('00:'+df['time'])
# filter for >= 22:10, i.e. second item
df = df[df['time'] >= df['time'].loc[1]]
print(df)
speed time date
1 15 00:22:10 1
2 13 00:22:15 1
So I've currently got a dataset that has a column called 'logid' which consists of 4 digit numbers. I have about 200k rows in my csv files and I would like to count each unique logid and output it something like this;
Logid | #ofoccurences for each unique id. So it might be 1000 | 10 meaning that the logid 1000 is seen 10 times in the csv file column 'logid'. The separator | is not necessary, just making it easier for you guys to read. This is my code currently:
import pandas as pd
import os, sys
import glob
count = 0
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
counts = df['my_data'].value_counts()
counts
Using this I get a weird output that I dont quite understand:
4 16463
10013 490
pserverno 1
Name: my_data, dtype: int64
I know I am doing something wrong in the last line
counts = df['my_data'].value_counts()
but I am not too sure what. For reference the values I am extracting are from row C in the excel file (so I guess thats column 3?) Thanks in advance!
ok. from my understanding. I think csv file may be like this.
row1,row1,row1
row2,row2,row2
row3,row3,row3
logid,header1,header2
1000,a,b
1001,c,d
1000,e,f
1001,g,h
And I have all done this with this format of csv file like
# skipping the first three row
df = pd.read_csv("file_name.csv", skiprows=3)
print(df['logid'].value_counts())
And the output look like this
1001 2
1000 2
Hope this will help.
update 1
df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
in this line the parameter names = ['my_data'] creates a new header of the data frame. As your csv file has header row so you can skip this parameter. And as the main header you want to row3 so you can skip first three row. And last one thing you are reading all csv file in given path. So be conscious all of the csv files are same format. Happy codding.
I think you need create one big DataFrame by append all df to list and then concat first:
dfs = []
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, usecols=['logid'], low_memory=False)
dfs.append(df)
df = pd.concat(dfs)
Then use value_counts - output is Series. So for 2 column DataFrame need rename_axis with reset_index:
counts = df['my_data'].value_counts().rename_axis('my_data').reset_index(name='count')
counts
Or groupby and aggregate size:
counts = df.groupby('my_data').size().reset_index(name='count')
counts
you may try this.
counts = df['logid'].value_counts()
Now the "counts" should give you the count of each value.