python access specific value in a pandas dataframe - python

I'm just starting up with python and am struggling to extract a value from my first column, at the end of the dataframe.
so let's say I have a .csv file with 3 columns:
id,name,country
1,bob,USA
2,john,Brazil
3,brian,austria
i'm trying to extract '3' from the ID column (last row ID value)
fileName=open('data.csv')
reader=csv.reader(fileName,delimiter=',')
count=0
for row in reader:
count=count+1
I'm able to get the rowcount, but am unsure how to get the value from that particular column

this should do the job:
import csv
fileName=open('123.csv')
reader=csv.reader(fileName,delimiter=',')
count=0
for row in reader:
if count == 3:
print(row[0])
count=count+1
but its better to import pandas and convert your csv file to a dataframe by doing this :
import csv
import pandas as pd
fileName=open('123.csv')
reader=csv.reader(fileName,delimiter=',')
df = pd.DataFrame(reader)
print(df.loc[3][0])
it would be easier to grab whatever element you want.
using loc, you can access any element using the row number and the column number, for example you wanted to grab the element 3 which is on the row 3, column 0, so you just grab it by df.loc[3][0]
if you don't have pandas installed, install it in the command prompt using the command:
pip install pandas

I found your question a bit ambiguous, so I'm answering for both cases.
If you need the first column, third row value:
value = None
with open('data.csv') as fileName:
reader = csv.reader(fileName, delimiter=',')
for row_number, row in enumerate(reader, 1):
if row_number == 3:
value = row[0]
If you need the first column, last row value:
value = None
with open('data.csv') as fileName:
reader = csv.reader(fileName, delimiter=',')
for row in reader:
value = row[0]
In both cases, value has the value you want.

As mentioned in the comments df['id'].iloc[-1] will return the last id value in the DataFrame which in this case is what you want.
You can also access based on the values in the other rows. For example:
df.id[(df.name == 'brian')] would also give you a value of 3 because brian is the name associated with an id of 3.
You also don't have to loop through the DataFrame rows to get the size, but when you have the DataFrame loaded can simply do count = df.shape[0] which will return the number of rows.

Given that you are starting with Python, and looking at the code provided, I think this Idiomatic Python video will be super helpful. Transforming Code Into Beautiful, Idiomatic Python | Raymond Hettinger
In addition to pandas documentation referenced below, this summary is pretty helpful as well:
Select rows in pandas MultiIndex DataFrame.
Pandas indexing documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

Related

Access latest entry of name X in CSV file

I have a csv file, the columns are:
Date,Time,Spread,Result,Direction,Entry,TP,SL,Bal
And typical entries would look like:
16/07/21,01:25:05,N/A,No Id,Null,N/A,N/A,N/A,N/A
16/07/21,01:30:06,N/A,No Id,Null,N/A,N/A,N/A,N/A
16/07/21,01:35:05,8.06,Did not qualify,Long,N/A,N/A,N/A,N/A
16/07/21,01:38:20,6.61,Trade,Long,1906.03,1912.6440000000002,1900.0,1000.0
16/07/21,01:41:06,N/A,No Id,Null,N/A,N/A,N/A,N/A
How would I access the latest entry where the Result column entry is equal to Trade preferably without looping through the whole file?
If it must be a loop, it would have to loop backwards from latest to earliest because it is a large csv file.
If you want to use pandas, try using read_csv with loc:
df = pd.read_csv('yourcsv.csv')
print(df.loc[df['Result'] == 'Trade'].iloc[[-1]])
Load your .csv into a pd.DataFrame and you can get all the rows where df.Results equals Trade like this:
df[df.Result == 'Trade']
if you only want the last one then list use .iloc
df[df.Result == 'Trade'].iloc[-1]
I hope this is what you are looking for.
I suggest you use pandas, but in case you really cannot, here's an approach.
Assuming the data is in data.csv:
from csv import reader
with open("data.csv") as data:
rows = [row for row in reader(data)]
col = rows[0].index('Result')
res = [row for i, row in enumerate(rows) if i > 0 and row[col] == 'Trade']
I advise against using this, way too brittle.

How do I search for a variable in a column and then print out all the rows that contain the number in a csv file with Python 3

I'm making a program that checks for a number in a CSV and then prints out the entire row. The 4th one is the number. Here is what I tried:
with open('Info.csv', 'r') as csv_file
if 'View C' in choice:
read = csv.reader(csv_file)
view = str(input("Enter Number: "))
for column in read:
if view == column[3]:
print(row)
The csv file is structured like this:
John,Smith,London,131390890
Bob,Builder,Moscow,123123132
Dab,God,LA,131390890
I want the program to be like this:
Input:
123123132
Output:
Bob,Builder,Moscow,123123132,
It also needs to be able to do this:
Input:
131390890
Output:
John,Smith,London,131390890,
Dab,God,LA,131390890,
Thank you!
Btw using Python 3...
You can use list comprehension. To find the rows that have the number you are looking for, read the entire file, appending each line to a list named content. Then:
result = [i for i in content if i[-1] == 'your_number']
This way, the variable result will contain N rows, each having the last column matching your number
Use loc to subset the DataFrame based on a value in a column. Assuming that the name of the columns is whichever_column_your_number_exists and the value you are looking for is view. Also ensure that the datatype of the column is same as the value view that you are searching for.
df = pd.read_csv('Info.csv') #Read your CSV file
your_df = df.loc[df['whichever_column_your_number_exists'] == view] #Assuming view is str type
And then iterate over each row of your new DataFrame yourdf and print value in each of the columns.
for row in your_df.index:
print(df.at[row, column1_name], df.at[row, column2_name], df.at[row, column3_name], df.at[row, column4_name]

CSV manipulation | Searching columns | Checking rules

Help is greatly appreciated!
I have a CSV that looks like this:
CSV example
I am writing a program to check that each column holds the correct data type. For example:
Column 1 - Must have valid time stamp
Column 2 - Must hold the value 2
Column 4 - Must be consecutive (If not how many packets missing)
Column 5/6 - Calculation done on both values and outcome must much inputted value
The columns can be in different positions.
I have tried using the pandas module to give each column an 'id' using the pandas module:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
print df.keys()
print df.star_name
However when doing the checks on the data it seems get confused. What would be the next best approach to do something like this?
I have really been killing myself over this and any help would be appreciated.
Thank you!
Try using the 'csv' module.
Example
import csv
with open('data.csv', 'r') as f:
# The first line of the file is assumed to contain the column names
reader = csv.DictReader(f)
# Read one row at a time
# If you need to compare with the previous row, just store that in a variable(s)
prev_title_4_value = 0
for row in reader:
print(row['Title 1'], row['Title 3'])
# Sample to illustrate how column 4 values can be compared
curr_title_4_value = int(row['Title 4'])
if (curr_title_4_value - prev_title_4_value) != 1:
print 'Values are not consecutive'
prev_title_4_value = curr_title_4_value

how to filter rows that satisfy a regular expression via pandas

I'm trying to figure out a way to to select only the rows that satisfy my regular expression via Pandas. My actual dataset, data.csv, has one column(the heading is not labeled) and millions of row. The first four rows look like:
5;4Z13H;;L
5;346;4567;;O
5;342;4563;;P
5;3LPH14;4567;;O
and I wrote the following regular expression
([1-9][A-Z](.*?);|[A-Z][A-Z](.*?);|[A-Z][1-9](.*?);)
which would identify 4Z13H; from row 1 and 3LPH14; from row 4. Basically I would like pandas to filter my data and select rows 1 and 4.
So my desired output would be
5;4Z13H;;L
5;3LPH14;4567;;O
I would then like to save the subset of filter rows into a new csv, filteredData.csv. So far I only have this:
import pandas as pd
import numpy as np
import sys
import re
sys.stdout=open("filteredData.csv","w")
def Process(filename, chunksize):
for chunk in pd.read_csv(filename, chunksize=chunksize):
df[0] = df[0].re.compile(r"([1-9][A-Z]|[A-Z][A-Z]|[A-Z][1-9])(.*?);")
sys.stdout.close()
if __name__ == "__main__":
Process('data.csv', 10 ** 4)
I'm still relatively new to python so the code above has some syntax issues(I'm still trying to figure out how to use pandas chunksize). However the main issue is filtering the rows by the regular expression. I'd greatly appreciate anyone's advice
One way is to read the csv as pandas dataframe and then use str.contains to create a mask column
df['mask'] = df[0].str.contains('(\d+[A-Z]+\d+)') #0 is the column name
df = (df[df['mask'] == True]).drop('mask', axis = 1)
You get the desired dataframe, if you wish, you can reset index using df = df.reset_index()
0
0 5;4Z13H;;L
3 5;3LPH14;4567;;O
Second is to first read the csv and create an edit file with only the filtered rows and then read the filtered csv to create the dataframe
with open('filteredData.csv', 'r') as f_in:
with open('filteredData_edit.csv', 'w') as f_outfile:
f_out = csv.writer(f_outfile)
for line in f_in:
line = line.strip()
row = []
if bool(re.search("(\d+[A-Z]+\d+)", line)):
row.append(line)
f_out.writerow(row)
df = pd.read_csv('filteredData_edit.csv', header = None)
You get
0
0 5;4Z13H;;L
1 5;3LPH14;4567;;O
From my experience, I would prefer the second method as it would be more efficient to filter out the undesired rows before creating the dataframe.

Python Pandas iterrows method

I'm "pseudo" creating a .bib file by reading a csv file and then following this structure writing down every thing including newline characters. It's a tedious process but it's a raw form on converting csv to .bib in python.
I'm using Pandas to read csv and write row by row, (and since it has special characters I'm using latin1 encoder) but I'm getting a huge problem: it only reads the first row. From the official documentation I'm using their method on reading row by row, which only gives me the first row (example 1):
row = next(df.iterrows())[1]
But if I remove the next() and [1] it gives me the content of every column concentrated in one field (example 2).
Why is this happenning? Why using the method in the docs does not iterate through all rows nicely? How would be the solution for example 1 but for all rows?
My code:
import csv
import pandas
import bibtexparser
import codecs
colnames = ['AUTORES', 'TITULO', 'OUTROS', 'DATA','NOMEREVISTA','LOCAL','VOL','NUM','PAG','PAG2','ISBN','ISSN','ISSN2','ERC','IF','DOI','CODEN','WOS','SCOPUS','URL','CODIGO BIBLIOGRAFICO','INDEXAÇÕES',
'EXTRAINFO','TESTE']
data = pandas.read_csv('test1.csv', names=colnames, delimiter =r";", encoding='latin1')#, nrows=1
df = pandas.DataFrame(data=data)
with codecs.open('test1.txt', 'w', encoding='latin1') as fh:
fh.write('#Book{Arp, ')
fh.write('\n')
rl = data.iterrows()
for i in rl:
ix = str(i)
fh.write(' Title = {')
fh.write(ix)
fh.write('}')
fh.write('\n')
PS: I'm new to python and programming, I know this code has flaws and it's not the most effective way to convert csv to bib.
The example row = next(df.iterrows())[1] intentionally only returns the first row.
df.iterrows() returns a generator over tuples describing the rows. The tuple's first entry contains the row index and the second entry is a pandas series with your data of the row.
Hence, next(df.iterrows()) returns the next entry of the generator. If next has not been called before, this is the very first tuple.
Accordingly, next(df.iterrows())[1] returns the first row (i.e. the second tuple entry) as a pandas series.
What you are looking for is probably something like this:
for row_index, row in df.iterrows():
convert_to_bib(row)
Secondly, all your writing to your file handle fh must happen within the block with codecs.open('test1.txt', 'w', encoding='latin1') as fh:
because at the end of the block the file handle will be closed.
For example:
with codecs.open('test1.txt', 'w', encoding='latin1') as fh:
# iterate through all rows
for row_index, row in df.iterrows():
# iterate through all elements in the row
for colname in df.columns:
row_element = row[colname]
fh.write('%s = {%s},\n' % (colname, str(row_element)))
Still I am not sure if the names of the columns exactly match the bibtex fields you have in mind. Probably you have to convert these first. But I hope you get the principle behind the iterations :-)

Categories

Resources