Counting unique id's in csv file using Pandas (python) - python

So I've currently got a dataset that has a column called 'logid' which consists of 4 digit numbers. I have about 200k rows in my csv files and I would like to count each unique logid and output it something like this;
Logid | #ofoccurences for each unique id. So it might be 1000 | 10 meaning that the logid 1000 is seen 10 times in the csv file column 'logid'. The separator | is not necessary, just making it easier for you guys to read. This is my code currently:
import pandas as pd
import os, sys
import glob
count = 0
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
counts = df['my_data'].value_counts()
counts
Using this I get a weird output that I dont quite understand:
4 16463
10013 490
pserverno 1
Name: my_data, dtype: int64
I know I am doing something wrong in the last line
counts = df['my_data'].value_counts()
but I am not too sure what. For reference the values I am extracting are from row C in the excel file (so I guess thats column 3?) Thanks in advance!

ok. from my understanding. I think csv file may be like this.
row1,row1,row1
row2,row2,row2
row3,row3,row3
logid,header1,header2
1000,a,b
1001,c,d
1000,e,f
1001,g,h
And I have all done this with this format of csv file like
# skipping the first three row
df = pd.read_csv("file_name.csv", skiprows=3)
print(df['logid'].value_counts())
And the output look like this
1001 2
1000 2
Hope this will help.
update 1
df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
in this line the parameter names = ['my_data'] creates a new header of the data frame. As your csv file has header row so you can skip this parameter. And as the main header you want to row3 so you can skip first three row. And last one thing you are reading all csv file in given path. So be conscious all of the csv files are same format. Happy codding.

I think you need create one big DataFrame by append all df to list and then concat first:
dfs = []
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname, dtype=None, usecols=['logid'], low_memory=False)
dfs.append(df)
df = pd.concat(dfs)
Then use value_counts - output is Series. So for 2 column DataFrame need rename_axis with reset_index:
counts = df['my_data'].value_counts().rename_axis('my_data').reset_index(name='count')
counts
Or groupby and aggregate size:
counts = df.groupby('my_data').size().reset_index(name='count')
counts

you may try this.
counts = df['logid'].value_counts()
Now the "counts" should give you the count of each value.

Related

Creating a new dataframe to contain a section of 1 column from multiple csv files in Python

So I am trying to create a new dataframe that includes some data from 300+ csv files.
Each file contains upto 200,000 rows of data, I am only interested in 1 of the columns within each file (the same column for each file)
I am trying to combine these columns into 1 dataframe, where column 6 from csv 1 would be in the 1st column of the new dataframe, column 6 from csv 2 would be in the 2nd column of the new dataframe, and so on up until the 315th csv file.
I dont need all 200,000 rows of data to be extracted, but I am unsure of how I would extract just 2,000 rows from the middle section of the data (each file ranges in number of rows so the same exact rows for each file arent necessary, as long as it is the middle 2000)
Any help in extracting the 2000 rows from each file to populate different columns in the new dataframe would be greatly appreciated.
So far, I have manipulated the data to only contain the relevant column for each file. This displays all the rows of data in the column for each file individually.
I tried to use the iloc function to reduce this to 2000 rows but it did not display any actual data in the output.
I am unsure as to how I would now extract this data into a dataframe for all the columns to be contained.
import pandas as pd
import os
import glob
import itertools
#glob to get all csv files
path = os.getcwd()
csv_files = glob.glob(os.path.join('filepath/', "*.csv"))
#loop list of csv files
for f in csv_files:
df = pd.read_csv(f, header=None)
df.rename(columns={6: 'AE'}, inplace=True)
new_df = df.filter(['AE'])
print('Location:', f)
print('File Name:', f.split("\\")[-1])
print('Content:')
display(new_df)
print()
Based on your description, I am inferring that you have a number of different files in csv format, each of which has at least 2000 lines and 6 columns. You want to take the data only from the 6th column of each file and only for the middle 2000 records in each file and to put all of those blocks of 2000 records into a new dataframe, with a column that in some way identifies which file the block came from.
You can read each file using pandas, as you have done, and then you need to use loc, as one of the commenters said, to select the 2000 records you want to keep. If you save each of those blocks of records in a separate dataframe you can then use the pandas concat method to join them all together into different columns of a new dataframe.
Here is some code that I hope will be self-explanatory. I have assumed that you want the 6th column, which is the one with index 5 in pandas because we start counting from 0. I have also used usecols to keep only the 6th column, and I rename the column to an index number based on the order in which the files are being read. You would need to change this for your own choice of column naming.
I choose the middle 2000 records by defining the starting point as record x, say, so that x + 2000 + x = total number of records, therefore x=(total number of records) / 2 - 1000. This might not be exactly how you want to define the middle 2000 records, so you could change this.
df_middles is a list to which we append every new dataframe of the new file's middle 2000 records. We use pd.concat at the end to put all the columns into a new dataframe.
import os
import glob
import pandas as pd
# glob to get all csv files
path = os.getcwd()
csv_files = glob.glob(os.path.join("filepath/", "*.csv"))
df_middles = []
# loop list of csv files
for idx, f in enumerate(csv_files, 1):
# only keep the 6th column (index 5)
df = pd.read_csv(f, header=None, usecols=[5])
colname = f"column_{idx}"
df.rename(columns={5: colname}, inplace=True)
number_of_lines = len(df)
if number_of_lines < 2000:
raise IOError(f"Not enough lines in the input file: {f}")
middle_range_start = int(number_of_lines / 2) - 1000
middle_range_end = middle_range_start + 1999
df_middle = df.loc[middle_range_start:middle_range_end].reset_index(drop=True)
df_middles.append(df_middle)
df_final = pd.concat(df_middles, axis="columns")

Readline to Array or List output

I am new to Python and was recently given a problem to solve.
Briefly, I have a .csv file consisting of some data. I am supposed to read the .csv file and print out the first 5 column header names, followed by 5 rows of the data in the following format shown in the picture.
Results
Currently, I have written the code up to:
readfiles = file.readlines()
for i in readfiles:
data = i.strip()
print(data)
and have managed to churn out all the data. However, I am not too sure how I can get the 5 rows of data which is required by the problem. I am thinking if the .csv file should be converted into an array/list? Hoping someone can help me on this. Thank you.
I can't use pandas or csv for this by the way. =/
df = pd.read_csv('\#pathtocsv.csv')
df.head()
if you want it in list
needed_list = df.head().tolist()
First of all, if you want to read a csv file, you can use pandas library to do it.
import pandas as pd
df = pd.read_csv("path/to/your/file")
print(df.columns[0:5]) # print first 5 column names
print(df.head(5)) # Print first 5 rows
Or if you want it to do without pandas then,
rows = []
with open("path/to/file.csv", "r") as fl:
rows = [x.split(",") for x in fl.read().split("\n")]
print(rows[0][0:5]) # print first 5 column names
print(rows[0:5]) # print first 5 rows

Python Pandas dont start on the first column

I loaded a CSV File using Python Pandas and want to drop every second column. I cant access the File from the first to last column. My CSV File has only one row with no captions. The origial file has about 1000 columns. For testing i use 12 columns. How to access the columns from the first to the last
I try to drop the first column by index. Later I want to iterate through it. I expect a index like 0 to 11 or index 1 to 12. Here is my code:
import pandas as pd
df = pd.read_csv("test.csv", index_col=0)
print(len(df.columns)) #returns 11 - expected: 12
df.drop(df.columns[0], axis=1)
df.to_csv('output.csv')
Code works, but with index 0 it drops the second column instead of the first and index 2 drops the fourth one and so on...
Hope you can help me
I've edit my code. Not pretty but it works:
import pandas as pd
fileName = 'test.csv'
dummy = pd.read_csv(fileName)
length = len(dummy.columns)
del dummy
df = pd.read_csv(fileName, usecols=[i for i in range(length) if i%2==0])
df.to_csv('output.csv', index=False)
Thank you for your answers

CSV manipulation | Searching columns | Checking rules

Help is greatly appreciated!
I have a CSV that looks like this:
CSV example
I am writing a program to check that each column holds the correct data type. For example:
Column 1 - Must have valid time stamp
Column 2 - Must hold the value 2
Column 4 - Must be consecutive (If not how many packets missing)
Column 5/6 - Calculation done on both values and outcome must much inputted value
The columns can be in different positions.
I have tried using the pandas module to give each column an 'id' using the pandas module:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
print df.keys()
print df.star_name
However when doing the checks on the data it seems get confused. What would be the next best approach to do something like this?
I have really been killing myself over this and any help would be appreciated.
Thank you!
Try using the 'csv' module.
Example
import csv
with open('data.csv', 'r') as f:
# The first line of the file is assumed to contain the column names
reader = csv.DictReader(f)
# Read one row at a time
# If you need to compare with the previous row, just store that in a variable(s)
prev_title_4_value = 0
for row in reader:
print(row['Title 1'], row['Title 3'])
# Sample to illustrate how column 4 values can be compared
curr_title_4_value = int(row['Title 4'])
if (curr_title_4_value - prev_title_4_value) != 1:
print 'Values are not consecutive'
prev_title_4_value = curr_title_4_value

Select 2 ranges of columns to load - read_csv in pandas

I'm reading in an excel .csv file using pandas.read_csv(). I want to read in 2 separate column ranges of the excel spreadsheet, e.g. columns A:D AND H:J, to appear in the final DataFrame. I know I can do it once the file has been loaded in using indexing, but can I specify 2 ranges of columns to load in?
I've tried something like this....
usecols=[0:3,7:9]
I know I could list each column number induvidually e.g.
usecols=[0,1,2,3,7,8,9]
but I have simplified the file in question, in my real file I have a large number of rows so I need to be able to select 2 large ranges to read in...
I'm not sure if there's an official-pretty-pandaic-way to do it with pandas.
But, You can do it this way:
# say you want to extract 2 ranges of columns
# columns 5 to 14
# and columns 30 to 66
import pandas as pd
range1 = [i for i in range(5,15)]
range2 = [i for i in range(30,67)]
usecols = range1 + range2
file_name = 'path/to/csv/file.csv'
df = pd.read_csv(file_name, usecols=usecols)
As #jezrael notes you can use numpy.r to do this in a more pythonic and legible way
import pandas as pd
import numpy as np
file_name = 'path/to/csv/file.csv'
df = pd.read_csv(file_name, usecols=np.r_[0:3, 7:9])
Gotchas
Watch out when use in combination with names that you have allowed for the extra column pandas adds for the index ie. For csv columns 1,2,3 (3 items) np.r_ needs to be 0:3 (4 items)

Categories

Resources