Python Programming Error for DataScience DataFrame - python

I am reading my data from a CSV file using pandas and it works well with range 700. But as soon as I go above 700 and trying to append to a list in python it is showing me list index out of range. But the CSV has around 500K of rows
Can anyone help me with that why is it happening?
Thanks in advance.
import pandas as pd
df_email = pd.read_csv('emails.csv',nrows=800)
test_email = df_email.iloc[:,-1]
list_of_emails = []
for i in range(len(test_email)):
var_email = test_email[i].split("\n") #this code takes one single email splits based on a new line giving a python list of all the strings in the email
email = {}
message_body = ''
for _ in var_email:
if ":" in _:
var_sentence = _.split(":") #this part actually uses the ":" to find the elements in the list that have ":" present
for j in range(len(var_sentence)):
if var_sentence[j].lower().strip() == "from":
email['from'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif var_sentence[j].lower().strip() == "to":
email['to'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif var_sentence[j].lower().strip() == 'subject':
if var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip() == 're':
email['subject'] = var_sentence[var_sentence.index(var_sentence[j+2])].lower().strip()
else:
email['subject'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif ":" not in _:
message_body += _.strip()
email['body'] = message_body
list_of_emails.append(email)

I am not sure of what you are trying to say here (might as well put example inputs and outputs here), but I came across this problem, which might be of the same nature, sometime weeks ago.
CSV files are comma-separated, which means it always takes note of every comma in a line to separate them in columns. If some dirty input from strings in your CSV file are present, then it will mess up the columns that you are expecting to have.
Best solution here is have some code to cleanup your CSV file, change its delimiter to another character (probably '|', '&', or anything that also doesn't mess up with the data), and revise your code to reflect these changes to the CSV.

use the pandas library to read the file.
it is very efficient and saves you time in writing the code yourself.
eg :
import pandas as pd
training_data = pd.read_csv( "train.csv", sep = ",", header = None )

Related

Removing Uneeded Information in each Cell from Column in .csv file and Exporting -- Error: expected string or bytes-like object

Alright Python gurus I've come for your help once again. For a work project, I want to strip the building numbers and apartment/lot/unit numbers from a list of addresses I have in a .csv file column. Each part of my code works individually, but not when I put it all together in a function. For background, my organization wants to see where our business is coming from by different locations that we have -- we have too many addresses to geocode with our current software, so we are sufficing for street level (we have internal geocoding, hence I didn't do it here). I get the error "expected string or bytes-like object" for the first line under my def statement; I understand that it is related to wanting to interpret this as a string, but when I define it as str(ADDRESS) and str(RemoveLSPACE), it takes every cell in my .csv and puts it into one string, whereas I need it separate. Any ideas on this or improvements on the rest of the code?
Previously, I had used some code to remove all integers from the address and that worked fine in the function, however, I ran into the problem of street names such as "4th St" which obviously became "th St".
import pandas as pd
import re
df = pd.read_csv('/Users/tjm4q2/Desktop/OB_Maps_Distinct_MRN.csv')
ADDRESS = df.Address
def extract_str(a):
#Remove leading/trailing numbers and spaces
RemoveLEADNUM = re.sub('^\d+', '', ADDRESS)
RemoveLSPACE = RemoveLEADNUM.lstrip(' ')
RemoveTRAILNUM = re.sub(r'\d+$', '', RemoveLSPACE)
RemoveRSPACE = RemoveTRAILNUM.rstrip(' ')
#Remove trailing Apt, Lot, Unit, or Unt text
APT = " APT"
LOT = " LOT"
UNIT = " UNIT"
if APT in RemoveRSPACE:
RemoveAPTLOT = RemoveRSPACE.replace(" APT", "")
elif LOT in RemoveRSPACE:
RemoveAPTLOT = RemoveRSPACE.replace(" LOT", "")
elif UNIT in RemoveRSPACE:
RemoveAPTLOT = RemoveRSPACE.replace(" UNIT", "")
else:
RemoveAPTLOT = RemoveRSPACE
#Remove leading and trailing space characters
RemoveFINALSPACE=RemoveAPTLOT.rstrip(' ')
return RemoveFINALSPACE
df['Street_Name'] = df['Address'].apply(lambda x: extract_str(x))
df.to_csv('Users\tjm4q2\Desktop\Output.csv')
All ideas welcome! :)

Ignoring commas in string literals while reading in .csv file without using any outside libraries

I am trying to read in a .csv file that has a line that looks something like this:
"Red","Apple, Tomato".
I want to read that line into a dictionary, using "Red" as the key and "Apple, Tomato" as the definition. I also want to do this without using any libraries or modules that need to be imported.
The issue I am facing is that it is trying to split that line into 3 separate pieces because there is a comma between "Apple" and "Tomato" that the code is splitting on. This is what I have right now:
file_folder = sys.argv[1]
file_path = open(file_folder+ "/food_colors.csv", "r")
food_dict = {}
for line in file_path:
(color, description) = line.rstrip().split(',')
print(f"{color}, {description}")
But this gives me an error because it has 3 pieces of data, but I am only giving it 2 variables to store the info in. How can I make this ignore the comma inside the string literal?
You can collect the remaining strings into a list, like so
color, *description = line.rstrip().split(',')
You can then join the description strings back together to make the value for your dict
Another way
color, description = line.rstrip().split(',', 1)
Would mean you only perform the split operation once and the rest of the string remains unsplit.
You can use pandas package and use pandas.DataFrame.read_csv.
For example, this works:
from io import StringIO
import pandas as pd
TESTDATA = StringIO('"Red","Apple, Tomato"')
df = pd.read_csv(TESTDATA, sep=",", header=None)
print(df)

Rewriting Single Words in a .txt with Python

I need to create a Database, using Python and a .txt file.
Creating new items is no Problem,the inside of the Databse.txt looks like this:
Index Objektname Objektplace Username
i.e:
1 Pen Office Daniel
2 Saw Shed Nic
6 Shovel Shed Evelyn
4 Knife Room6 Evelyn
I get the index from a QR-Scanner (OpenCV) and the other informations are gained via Tkinter Entrys and if an objekt is already saved in the Database, you should be able to rewrite Objektplace and Username.
My Problems now are the following:
If I scan the Code with the index 6, how do i navigate to that entry, even if it's not in line 6, without causing a Problem with the Room6?
How do I, for example, only replace the "Shed" from Index 4 when that Objekt is moved to f.e. Room6?
Same goes for the Usernames.
Up until now i've tried different methods, but nothing worked so far.
The last try looked something like this
def DBChange():
#Removes unwanted bits from the scanned code
data2 = data.replace("'", "")
Index = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
#Adds a whitespace at the end of the Entrys to seperate them
Userlen = len(User)
User2 = User.ljust(Userlen)
Einlagerungsortlen = len(Einlagerungsort)+1
Einlagerungsort2 = Einlagerungsort.ljust(Einlagerungsortlen)
#Navigate to the exact line of the scanned Index and replace the words
#for the place and the user ONLY in this line
file = open("Datenbank.txt","r+")
lines=file.readlines()
for word in lines[Index].split():
List.append(word)
checkWords = (List[2],List[3])
repWords = (Einlagerungsort2, User2)
for line in file:
for check, rep in zip(checkWords, repWords):
line = line.replace(check, rep)
file.write(line)
file.close()
Return()
Thanks in advance
I'd suggest using Pandas to read and write your textfile. That way you can just use the index to select the approriate line. And if there is no specific reason to use your text format, I would switch to csv for ease of use.
import pandas as pd
def DBChange():
#Removes unwanted bits from the scanned code
# I haven't changed this part, since I guess you need this for some input data
data2 = data.replace("'", "")
Indexnr = data2.replace("b","")
#Gets the Data from the Entry-Widgets
User = Nutzer.get()
Einlagerungsort = Ort.get()
# I removed the lines here. This isn't necessary when using csv and Pandas
# read in the csv file
df = pd.read_csv("Datenbank.csv")
# Select line with index and replace value
df.loc[Indexnr, 'Username'] = User
df.loc[Indexnr, 'Objektplace'] = Einlagerungsort
# Write back to csv
df.to_csv("Datenbank.csv")
Return()
Since I can't reproduce your specific problem, I haven't tested it. But something like this should work.
Edit
To read and write text-file, use ' ' as the seperator. (I assume all values do not contain spaces, and your text file now uses 1 space between values).
reading:
df = pd.read_csv('Datenbank.txt', sep=' ')
Writing:
df.to_csv('Datenbank.txt', sep=' ')
First of all, this is a terrible way to store data. My suggestion is not particularily well code, don't do this in production! (edit
newlines = []
for line in lines:
entry = line.split()
if entry[0] == Index:
#line now is the correct line
#Index 2 is the place, index 0 the ID, etc
entry[2] = Einlagerungsort2
newlines.append(" ".join(entry))
# Now write newlines back to the file

About dict.fromkeys, key from filename, values inside file, using Regex

Well, I'm learning Python, so I'm working on a project that consists in passing some numbers of PDF files to xlsx and placing them in their corresponding columns, rows determined according to row heading.
The idea that came to me to carry it out is to convert the PDF files to txt and make a dictionary with the txt files, whose key is a part of the file name (because it contains a part of the row header) and the values be the numbers I need.
I have already managed to convert the txt files, now i'm dealing with the script to carry the dictionary. at the moment look like this:
import os
import re
p = re.compile(r'\w+\f+')
'''
I'm not entirely sure at the moment how the .compile of regular expressions works, but I know I'm missing something to indicate that what I want is immediately to the right, I'm also not sure if the keywords will be ignored, I just want take out the numbers
'''
m = p.match('Theese are the keywords' or 'That are immediately to the left' or 'The numbers I want')
def IsinDict(txtDir):
ToData = ()
if txtDir == "": txtDir = os.getcwd() + "\\"
for txt in os.listdir(txtDir):
ToKey = txt[9:21]
if ToKey == (r"\w+"):
Data = open(txt, "r")
for string in Data:
ToData += m.group()
Diccionary = dict.fromkeys(ToKey, ToData)
return Diccionary
txtDir = "Absolute/Path/OfTheText/Files"
IsinDict(txtDir)
Any contribution is welcome, thanks for your attention.

Huge timing difference in string concatenation between Python 3.5 vs 2.7

I have 340,000 line file data , when I reading file with Python 3.5 timing is good but when I run it with Python 2.7 reading is very slow , don't have any clue what is going on here, here is code:
import codecs as cds
INPUT_DATA_DIR = 'some_tsv_file.tsv'
ENT = "entities"
def train_data_getter(input_dir=INPUT_DATA_DIR):
file_h = cds.open(input_dir, encoding='utf-8')
data = file_h.read()
file_h.close()
sentences = data.split("\n\n")
parsed_data = parser(sentences[0])
return parsed_data
def parser(raw_data):
words = [line for line in raw_data.split("\n")]
temp_l = []
temp_s = ""
for word in words:
token, ent = word.split('\t')
temp_s += token
temp_s += " "
temp_l.append(ent)
data = [(temp_s), {ENT: temp_l}]
return data
Edit
Thanks to #PM 2Ring , problem was string concatenation inside a for loop but still reason for huge difference between Python2.7 and 3.5 is not clear for me.
Your iteratively doing append 340,000 times inside a loop is woefully inefficient, so just don't do it.
Anyway, pandas supports tsv read, it will be more performant, and supports chunksize argument for fast read of large csv/tsv files:
import pandas
train = pd.read_table('some_tsv_file.tsv', delim_whitespace=True, chunksize=100000)
# you probably need encoding='utf-8'. You may also need to tweak the settings for header, skiprows etc. Read the doc.

Categories

Resources