I'm very new to python and I'm trying to get the index of an element in a list of lists. There goes my list:
Data = [['0', '999.8', '1.78e-3'], ['5', '1000', '1.52e-3'], ['10', '999.7', '1.31e-3'], ['15', '999.1', '1.14e-3'], ['20', '998.2', '1.00e-3'], ['25', '997.0', '0.89e-3'], ['30', '995.7', '0.80e-3'], ['40', '992.2', '0.65e-3']]
I want to find the index of'10'. There is my code:
for element in data:
for e in element:
index_valeur = e.index('10')
print(index_valeur)
It doesn't seem to work and this is the error message:
ValueError: substring not found
How can I get the index of the value?
Pythonic best way is to use Pandas, my RAW attempt ;-)
import numpy as np
import pandas as pd
import io
mycsv = '''
T,rho,mu
0,999.8,1.78e-3
5,1000,1.52e-3
10,999.7,1.31e-3
15,999.1,1.14e-3
20,998.2,1.00e-3
25,997.0,0.89e-3
30,995.7,0.80e-3
40,992.2,0.65e-3
'''
myNum = float(input("Enter number: "))
df = pd.read_csv(io.StringIO(mycsv))
print(sorted(df['T'].values.tolist(), key= lambda x:abs(x-myNum))[:2])
I need to set the global float precision to the minimum value possible
Also, I need to get the precision for each column, in part to get the global precision and on the other hand, I would like to use as many decimal places as the user wants for each column.
I get the data from a CSV. In the beginning, I load all the cells as strings. After the conversion to numbers, the columns could have different dtypes.
In the integer columns (without '.') there are not any NaN values. So I thought I could make a copy of the dataframe when it contains strings and split the number by the '.' character. Because if the cells already have float numbers I could not get the number of decimal places because I could get something like this: 5.55 % 1 >> 0.550000000001. I mean that sometimes python only prints a decimal approximation to the true decimal value of the binary approximation stored by the machine. Then I understand that is not possible to get the decimal values accurately.
There are no columns with all the values NaN
import pandas as pd
pd.set_option('precision', 15) # if > 15 the precision is not working well
df = pd.DataFrame({
'x':['5.111112222233', '5.111112222', '5.11111222223', '5.2227', '234', '4', '5.0'],
'y':['ÑKDFGÑKL', 'VBNVBN', 'GHJGHJ', 'GFGDF', 'SDFS', 'SDFASD', 'LKJ'],
'z':['5.0', '5.0', '5.0', '5.0', '3', '6', '5.0'],
'a':['5', '5', '5', '5', '3', '6', '9'],
'b':['5.0', '5.0', '5.0', '5.0', '3.8789', '6', np.nan],
})
df_str = df.copy(deep=True)
df = df.apply(lambda t: pd.to_numeric(t, errors='ignore', downcast='integer'))
precisions = {}
pd_precision = 0
# Float columns
for c in df.select_dtypes(include=['float64']):
p = int(df_str[c].str.rsplit(pat='.', n=1, expand=True)[1].str.len().max()) # always has one '.'
if p > pd_precision:
pd_precision = p
precisions[c] = p
# Integer columns
for c in df.select_dtypes(include=['int8', 'int16', 'int32', 'int64']):
precisions[c] = 0
# String and mixed columns
for c in df.select_dtypes(include=['object']): # or exclude=['int8', 'int16', 'int32', 'int64', 'float64']
precisions[c] = False
if pd_precision > 15:
pd_precision = 15
pd.set_option('precision', pd_precision) # pd_precision = 12
precisions # => {'x': 12, 'b': 4, 'z': 0, 'a': 0, 'y': False}
I know there is a Decimal class, but I believe I would lose all the benefits of performance of a pandas dataframes with floats.
Is there a better way to get the number of decimal places?
Let me try to explain to the best of my ability as I am not a Python wizard. I have read with PyPDF2 a PDF table of data regarding covid-19 in Mexico and tokenize it—long story, I tried doing it with tabula but did not get the format I was expecting and I was going to spend more time reformatting the CSV document I have gotten back than analyzing it—and have gotten a list of strings back with len of 16792 which is fine.
Now, the problem I am facing is that I need to format it in the appropriate way by concatenating some (not all) of those strings together so I can create a list of lists with the same length which is 9 columns.
This is an example of how it looks right now, the columns are Case number, State, Locality, Gender, Age, Date when symptoms started, Status, Type of contagion, Date of arrival to Mexico:
['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020'
What I would want is to get certain strings like 'ZONA', 'NORTE' as 'ZONA NORTE' or 'CIUDAD', 'DE', 'MEXICO' as 'CIUDAD DE MEXICO' or 'ESTADOS', 'UNIDOS' as 'ESTADOS UNIDOS'...
I seriously do not know how to tackle this. I have tried, split(), replace(), trying to find the index of each frequency, read all questions about manipulating lists, tried almost all the responses provided... and haven't been able to do it.
Any guidance, will be greatly appreciated. Sorry if this is a very basic question, but I know there has to be a way, I just don't know it.
Since the phrase split is not the same for each row, you have a way to "recognize" it. One way would be if 2 list item that are next to each other have more than 3 letter, then join them.
import re
row_list = ['1', 'PUEBLA', 'PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '2', 'GUERRERO', 'ZONA', 'NORTE', 'M', '29', '15/03/2020', 'Sospechoso', 'Contacto', 'NA', '3', 'BAJA', 'CALIFORNIA', 'TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso', 'Estados', 'Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO', 'TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso', 'Italia', '03/03/2020', '5', 'JALISCO', 'CENTRO', 'GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso', 'España', '17/03/2020']
words_longer_than_3 = r'([^\d\W]){3,}'
def is_a_word(text):
return bool(re.findall(words_longer_than_3, text))
def get_next_item(row_list, i):
try:
next_item = row_list[i+1]
except IndexError:
return
return is_a_word(next_item)
for i, item in enumerate(row_list):
item_is_a_word = is_a_word(row_list[i])
if not item_is_a_word:
continue
next_item_is_a_word = get_next_item(row_list, i)
while next_item_is_a_word:
row_list[i] += f' {row_list[i+1]}'
del row_list[i+1]
next_item_is_a_word = get_next_item(row_list, i)
print(row_list)
result:
['1', 'PUEBLA PUEBLA', 'M', '49', '15/03/2020', 'Sospechoso Contacto', 'NA', '2', 'GUERRERO ZONA NORTE', 'M', '29', '15/03/2020', 'Sospechoso Contacto', 'NA', '3', 'BAJA CALIFORNIA TIJUANA', 'F', '34', '14/03/2020', 'Sospechoso Estados Unidos', '08/03/2020', '4', 'CIUDAD', 'DE', 'MÉXICO TLALPAN', 'F', '69', '25/02/2020', 'Sospechoso Italia', '03/03/2020', '5', 'JALISCO CENTRO GUADALAJARA', 'M', '19', '18/03/2020', 'Sospechoso España', '17/03/2020']
I suppose that the data you want to process comes from a similar file like this one here, which contains 2623 rows x 8 columns.
You can load the data from the PDF file using tabula-py. You can install it through pip install tabula-py==1.3.0. One issue with extracting the table like this is that tabula-py can mess up things sometimes. For example, the header of the table was extracted from the PDF file like this:
"",,,,Identificación Fecha de,
"",,,,de COVID-19,Fecha del llegada a
N° Caso,Estado,Sexo,Edad,Inicio de Procedencia por RT-PCR en,México
"",,,,síntomas,
"",,,,tiempo real,
Pretty nasty huh?
In addition, tabula-py could not separate some of the columns, that is writing a comma in the right place so that the output CSV file would be well-parsed. For instance, the line with número de caso (case number) 8:
8,BAJA CALIFORNIA,F,65,13/03/2020 Sospechoso Estados Unidos,08/03/2020
could be fixed by just replacing " Sospechoso " by ",Sospechoso,". And you are lucky becuase this is the only parse issue you will have to deal with for now. Thus, iterating through the lines of the output CSV file and replacing " Sospechoso " by ",Sospechoso," takes care of everything.
Finally, I added an option (removeaccents) for removing the accents from the data. This can help you avoid encoding issues in the future. For this you will need unicode: pip install unidecode.
Putting everything together, the code that reads the PDF file and converts it to a CSV file and loads it as a pandas dataframe is as follows (You can download the preprocessed CSV file here):
import tabula
from tabula import wrapper
import pandas as pd
import unidecode
"""
1. Adds the header.
2. Skips "corrupted" lines from tabula.
3. Replacing " Sospechoso " for ",Sospechoso," automatically separates
the previous column ("fecha_sintomas") from the next one ("procedencia").
4. Finally, elminiates accents to avoid encoding issues.
"""
def simplePrep(
input_path,
header = [
"numero_caso",
"estado",
"sexo",
"edad",
"fecha_sintomas",
"identification_tiempo_real",
"procedencia",
"fecha_llegada_mexico"
],
lookfor = " Sospechoso ",
replacewith = ",Sospechoso,",
output_path = "preoprocessed.csv",
skiprowsupto = 5,
removeaccents = True
):
fin = open(input_path, "rt")
fout = open(output_path, "wt")
fout.write(",".join(header) + "\n")
count = 0
for line in fin:
if count > skiprowsupto - 1:
if removeaccents:
fout.write(unidecode.unidecode(line.replace(lookfor, replacewith)))
else:
fout.write(line.replace(lookfor, replacewith))
count += 1
fin.close()
fout.close()
"""
Reads all the pdf pages specifying that the table spans multiple pages.
multiple_tables = True, otherwise the first row of each page will be missed.
"""
tabula.convert_into(
input_path = "data.pdf",
output_path = "output.csv",
output_format = "csv",
pages = 'all',
multiple_tables = True
)
simplePrep("output.csv", removeaccents = True)
# reads preprocess data set
df = pd.read_csv("preoprocessed.csv", header = 0)
# prints the first 5 samples in the dataframe
print(df.head(5))
These are my stacks of arrays, both with variables arranged columnwise.
final_a = np.stack((four, five, st, dist, ru), axis=-1)
final_b = np.stack((org, own, origin, init), axis=-1)
Example:
In: final_a
Out: array([['9999', '10793', ' 1', '99', '2'],
['9999', '10799', ' 1', '99', '2'],
['9999', '10712', ' 1', '99', '2'],
...,
['9999', '23960', '33', '99', '1'],
['9999', '82920', '33', '99', '2'],
['9999', '82920', '33', '99', '2']],
dtype='<U5')
But when I try to save either of them to a .csv file using this code:
np.savetxt("/Users/jaisaranc/Documents/ASI selected data - A.csv", final_a, delimiter=",")
It throws this error:
TypeError: Mismatch between array dtype ('<U5') and format specifier ('%.18e,%.18e,%.18e,%.18e,%.18e')
I have no idea what to do.
savetxt in Numpy allows you specify a format for how the array will be displayed when it's written to a file. The default format (fmt='%.18e') can only format arrays containing only numeric elements. Your array contains strings (dtype='<U5' means the type is unicode with length 5) so it raises an error. In your case you should also include fmt='%s' as an argument to ensure the array elements in the output file are formatted as strings. For example:
np.savetxt("example.csv", final_a, delimeter=",", fmt="%s")
I am having a bit of trouble getting started on an assignment. We are issued a tab delineated .txt file with 6 columns of data and around 50 lines of this data. I need help starting a list to store this data in for later recall. Eventually I will need to be able to list all the contents of any particular column and sort it, count it, etc. Any help would be appreciated.
Edit; I really haven't done much besides research on this kinda stuff, I know ill be looking into csv, and i have done single column .txt files before but im not sure how to tackle this situation. How will I give names to the separate columns? how will I tell the program when one row ends and the next begins?
The dataframe structure in Pandas basically does exactly what you want. It's highly analogous to the data frame in R if you're familiar with that. It has built in options for subsetting, sorting, and otherwise manipulating tabular data.
It reads directly from csv and even automatically reads in column names. You'd call:
read_csv(yourfilename,
sep='\t', # makes it tab delimited
header=1) # makes the first row the header row.
Works in Python 3.
Let's say you have a csv like the following.
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
You can read them into a dictionary like so:
>>> import csv
>>> reader = csv.DictReader(open('test.csv','r'), fieldnames= ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], dialect='excel-tab')
>>> for row in reader:
... print row
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
But Pandas library might be better suited for this. http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files
Sounds like a job better suited to a database. You should just use something like PostgreSQLs COPY FROM operation to import the CSV data into a table then use python + SQL for all your sorting, searching and matching needs.
If you feel a real database is overkill there's still options like SQLlite and BerkleyDB which both have python modules.
EDIT: BerkelyDB is deprecated but anydbm is similiar in concept.
I think using a db for 50 lines and 6 colums is overkill, so here's my idea:
from __future__ import print_function
import os
from operator import itemgetter
def get_records_from_file(path_to_file):
"""
Read a tab-deliminated file and return a
list of dictionaries representing the data.
"""
records = []
with open(path_to_file, 'r') as f:
# Use the first line to get names for columns
fields = [e.lower() for e in f.readline().split('\t')]
# Iterate over the rest of the lines and store records
for line in f:
record = {}
for i, field in enumerate(line.split('\t')):
record[fields[i]] = field
records.append(record)
return records
if __name__ == '__main__':
path = os.path.join(os.getcwd(), 'so.txt')
records = get_records_from_file(path)
print('Number of records: {0}'.format(len(records)))
s = sorted(records, key=itemgetter('id'))
print('Sorted: {0}'.format(s))
For storing records for later use, look into Python's pickle library--that'll allow you to preserve them as Python objects.
Also, note I don't have Python 3 installed on the computer I'm using right now, but I'm pretty sure this'll work on Python 2 or 3.