How to count number of entries in particular column - python

I have an excel file with thousands of entries. I want to count the number of entries in the first column.
import csv
with open('data.csv') as f:
reader = csv.reader(f)
annotated_data = [r for r in reader]
so Now I want to count the entries, I tried doing:
a = 0
b = 0
c = 0
d = 0
e = 0
for i in annotated_data:
if annotated_data[0][i] == A:
a=a+1
if annotated_data[0][i] == B:
b=b+1
if annotated_data[0][i] == C:
//continue until E
print("Total number of A:" +a ) //continue until E
But it told me "list indices must be integers or slices, not list". so I tried doing
for i in range(annotated_data)
and it told me "'list' object cannot be interpreted as an integer"
im not sure what else to do, any help appreciated

Iterating through a list gives you items in the list, not their indices.
So, do this:
for row in annotated_data:
first_cell = row[0]
If you really wanted to have the indices, you would have to pass a number to range, rather than the list, i.e.:
range(len(annotated_data))
But I would not recommend doing that. It only makes things slower, less readible, and not compatible with all container types.
If you really needed both indices and items, you could do this:
for row_number, row in enumerate(annotated_data):
first_cell = row[0]

As a quick fix, you may want to try
if i[0] == A:
a += 1
etc. Or if you are looking for the literal string 'A', then:
if i[0] == 'A':

Install pandas using pip install pandas
Then, you could do something like this.
import pandas as pd
df = pd.read_csv('path to file.csv')
print(len(df) + 1)

Related

How to separate different input formats from the same text file with Python

I'm new to programming and python and I'm looking for a way to distinguish between two input formats in the same input file text file. For example, let's say I have an input file like so where values are comma-separated:
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
Where the format is N followed by N lines of Data1, and M followed by M lines of Data2. I tried opening the file, reading it line by line and storing it into one single list, but I'm not sure how to go about to produce 2 lists for Data1 and Data2, such that I would get:
Data1 = ["Washington,A,10", "New York,B,20", "Seattle,C,30", "Boston,B,20", "Atlanta,D,50"]
Data2 = ["New York,5", "Boston,10"]
My initial idea was to iterate through the list until I found an integer i, remove the integer from the list and continue for the next i iterations all while storing the subsequent values in a separate list, until I found the next integer and then repeat. However, this would destroy my initial list. Is there a better way to separate the two data formats in different lists?
You could use itertools.islice and a list comprehension:
from itertools import islice
string = """
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
"""
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [string.split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
This yields
[['Washington,A,10', 'New York,B,20', 'Seattle,C,30', 'Boston,B,20', 'Atlanta,D,50'], ['New York,5', 'Boston,10']]
For a file, you need to change it to:
with open("testfile.txt", "r") as f:
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [f.read().split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
You're definitely on the right track.
If you want to preserve the original list here, you don't actually have to remove integer i; you can just go on to the next item.
Code:
originalData = []
formattedData = []
with open("data.txt", "r") as f :
f = list(f)
originalData = f
i = 0
while i < len(f): # Iterate through every line
try:
n = int(f[i]) # See if line can be cast to an integer
originalData[i] = n # Change string to int in original
formattedData.append([])
for j in range(n):
i += 1
item = f[i].replace('\n', '')
originalData[i] = item # Remove newline char in original
formattedData[-1].append(item)
except ValueError:
print("File has incorrect format")
i += 1
print(originalData)
print(formattedData)
The following code will produce a list results which is equal to [Data1, Data2].
The code assumes that the number of entries specified is exactly the amount that there is. That means that for a file like this, it will not work.
2
New York,5
Boston,10
Seattle,30
The code:
# get the data from the text file
with open('filename.txt', 'r') as file:
lines = file.read().splitlines()
results = []
index = 0
while index < len(lines):
# Find the start and end values.
start = index + 1
end = start + int(lines[index])
# Everything from the start up to and excluding the end index gets added
results.append(lines[start:end])
# Update the index
index = end

Removing quotes from 2D array python

I am currently trying to execute code that evaluetes powers with big exponents without calculating them, but instead logs of them. I have a file containing 1000 lines. Each line contains two itegers separated by a comma. I got stuck at point where i tried to remove quotes from array. I tried many way of which none worked. Here is my code:
function from myLib called split() takes two argumanets of which one is a list and second is to how many elemts to split the original list. Then does so and appends smaller lists to the new one.
import math
import myLib
i = 0
record = 0
cmpr = 0
with open("base_exp.txt", "r") as f:
fArr = f.readlines()
fArr = myLib.split(fArr, 1)
#place get rid of quotes
print(fArr)
while i < len(fArr):
cmpr = int(fArr[i][1]) * math.log(int(fArr[i][0]))
if cmpr > record:
record = cmpr
print(record)
i = i + 1
This is how my Array looks like:
[['519432,525806\n'], ['632382,518061\n'], ... ['172115,573985\n'], ['13846,725685\n']]
I tried to find a way around the 2d array and tried:
i = 0
record = 0
cmpr = 0
with open("base_exp.txt", "r") as f:
fArr = f.readlines()
#fArr = myLib.split(fArr, 1)
fArr = [x.replace("'", '') for x in fArr]
print(fArr)
while i < len(fArr):
cmpr = int(fArr[i][1]) * math.log(int(fArr[i][0]))
if cmpr > record:
record = cmpr
print(i)
i = i + 1
But output looked like this:
['519432,525806\n', '632382,518061\n', '78864,613712\n', ...
And the numbers in their current state cannot be considered as integers or floats so this isnt working as well...:
[int(i) for i in lst]
Expected output for the array itself would look like this, so i can pick one of the numbers and work with it:
[[519432,525806], [632382,518061], [78864,613712]...
I would really apreciate your help since im still very new to python and programming in general.
Thank you for your time.
You can avoid all of your problems by simply using numpy's convenient loadtxt function:
import numpy as np
arr = np.loadtxt('p099_base_exp.txt', delimiter=',')
arr
array([[519432., 525806.],
[632382., 518061.],
[ 78864., 613712.],
...,
[325361., 545187.],
[172115., 573985.],
[ 13846., 725685.]])
If you need a one-dimensional array:
arr.flatten()
# array([519432., 525806., 632382., ..., 573985., 13846., 725685.])
This is your missing piece:
fArr = [[int(num) for num in line.rstrip("\n").split(",")] for line in fArr]
Here, rstrip("\n") will remove trailing \n character from the line and then the string will be split on , so that each string will be become a list and all integers in that line will become elements of that list but as a string. Then, we can call int() function on each list element to convert them into int data type.
Below code should do the job if you don't want to import an additional library.
i = 0
record = 0
cmpr = 0
with open("base_exp.txt", "r") as f:
fArr = f.readlines()
fArr = [[int(num) for num in line.rstrip("\n").split(",")] for line in fArr]
print(fArr)
while i < len(fArr):
cmpr = fArr[i][1] * math.log(fArr[i][0])
if cmpr > record:
record = cmpr
print(i)
i = i + 1
This snippet will transform your array to 1D array of integers:
from itertools import chain
arr = [['519432,525806\n'], ['632382,518061\n']]
new_arr = [int(i.strip()) for i in chain.from_iterable(i[0].split(',') for i in arr)]
print(new_arr)
Prints:
[519432, 525806, 632382, 518061]
For 2D output you can use this:
arr = [['519432,525806\n'], ['632382,518061\n']]
new_arr = [[int(i) for i in v] for v in (i[0].split(',') for i in arr)]
print(new_arr)
This prints:
[[519432, 525806], [632382, 518061]]
new_list=[]
a=['519432,525806\n', '632382,518061\n', '78864,613712\n',]
for i in a:
new_list.append(list(map(int,i.split(","))))
print(new_list)
Output:
[[519432, 525806], [632382, 518061], [78864, 613712]]
In order to flatten the new_list
from functools import reduce
reduce(lambda x,y: x+y,new_list)
print(new_list)
Output:
[519432, 525806, 632382, 518061, 78864, 613712]

Using '' with open, reader '' functions

I got a problem with this code:
import csv
with open('gios-pjp-data.csv', 'r') as data:
l = []
reader = csv.reader(data, delimiter=';')
next(reader)
next(reader) # I need to skip 2 lines here and dont know how to do it in other way
l.append(# here is my problem that I will describe below)
So this file contains about 350 lines with 4 columns and
each one is built like this:
Date ; float number ; float number ; float number
Something like this:
2017-01-01;56.7;167.2;236.9
Now, I dont know how to build a function that would append first float number and third float number to the list on condition that its value is >200.
Do you have any suggestions?
List comprehentions if you don't have too many items in the file.
l = [x[1], x[3] for x in reader if x[1] > 200]
Or a similar function that would yield each line, if you have a huge number of entries.
def getitems():
for x in reader:
if x[1] > 200:
yield x[1], x[3]
l = getitems() # this is now an iterator, more memory efficient.
l = list(l) # now its a list

Creating a sequence of dataframes

a quick question.
I want to know if there is a way to create a sequence of data frames, by setting a variable inside the name of a data frame. For example:
df_0 = pd.read_csv(file1, sep =',')
b=0
x=1
while (b == 0):
df_+str(x) = pd.merge(df_+str(x-1) , Source, left_on='R_Key', right_on = 'S_Key', how='inner')
if Final_+str(x).empty != 'True':
x = x + 1
else:
b = b + 1
Now when executed, this returns "can't assign to operator" for df_+str(x). Any idea how to fix this?
This is the right time to use a list (a sequence type in Python), so you can refer to exactly as many data frames as you need.
dfs = []
dfs.append(pd.read_csv(file1, sep =',')) # It is now dfs[0]
b=0
x=1
while (b == 0):
dfs.append(pd.merge(dfs[x-1],
Source, left_on='R_Key',
right_on = 'S_Key', how='inner'))
if Final[x].empty != 'True':
x = x + 1
else:
b = b + 1
Now, you never define Final. You'll need to use the same trick there.
Not sure why you want to do this, but I think a clearer and more logical way is just to create a dictionary with dataframe name strings as keys and your generated dataframes as values?

Using the index of items in a list to compare the order they come in the list

I'm trying to compare the order of items in a list through assigning them to names. I've used 'if' statements to attempt this but it doesn't seem to be working. I think I might need to find the index of the item in the list and then figure out which is the greater of the two.
Here is my code:
import re, sys
f = open('findallEX.txt', 'r')
for line in f.readlines():
match = re.findall('[A-Z]+', line)
ii=0
if index in match: 'VERB'
verbindex = ii
ii = ii + 1
ii=0
if index in match: 'OBJ'
objindex = ii
ii = ii + 1
I've used the name 'index' just to demonstrate what I think it's need to be. Is there anyway to do this? Thanks!
Are you looking for something like this?
>>> data = ['VERB', 'SUBJ', 'OBJ']
>>> for n, i in enumerate(data):
... print n, i
...
0 VERB
1 SUBJ
2 OBJ
To find the index of an item in the list:
>>> ['VERB', 'SUBJ', 'OBJ'].index('SUBJ')
1

Categories

Resources