Python- Automating Variable naming for with imported CSV data [duplicate]

Python- Automating Variable naming for with imported CSV data [duplicate] - python

This question already has answers here:
How do I create variable variables?
(17 answers)
Closed 9 months ago.
I'm a little new to python, stuck in on the 6.00x (spring 2013) course.
I'd hoped to try some of my new found knowledge but appear to have overreached.
The idea was to import a load a CSV file containing my bank statement into python. I'd then hoped place turn each transaction into an instance of a class. I'd then hoped to start playing around with the data to see what I could do but I appear to be failing at even the first hurdle, getting things nicely fitted into my Object Oriented program.
I started with this to import my file:
import csv
datafile = open('PATH/TO/file.csv', 'r')
datareader = csv.reader(datafile)
data = []
for row in datareader:
data.append(row)
That seems to work. I get a list of all the statement data that looks something like this below (You'll understand me not uploading the real data...)
[['date', 'type', 'details', 'amount', 'balance', 'accountname', 'accountdetails', 'blank_string'],['date', 'type', 'details', 'amount', 'balance', 'accountname', 'accountdetails', 'blank_string'],['date', 'type', 'details', 'amount', 'balance', 'accountname', 'accountdetails', 'blank_string'],['date', 'type', 'details', 'amount', 'balance', 'accountname', 'accountdetails', 'blank_string']]
so typing data[0] would get me:
['date', 'type', 'details', 'amount', 'balance', 'accountname', 'accountdetails', 'blank_string']
So then I created my class and constructor. With the idea of decomposing each one of these transactions into an easily accessible item.
class Transaction(object):
"""
Abstract class for building different kinds of transaction
"""
def __init__(self, data):
self.date = data[0]
self.trans_type = data[1]
self.description = data[2]
self.amount = data[3]
self.balance = data[4]
self.account_type = data[5]
self.account_details = data[6]
I find this works if I now enter
T1 = Transaction(data[0])
However I don't want to have to constantly type T1 =... T2=... t3=... t4=... there's LOADS of transctions it'd take forever!
So I tried this!
for i in range(len(data)):
eval("T" + str(i)) = Transaction(data[i])
But python really doesn't like that... It reports back:
SyntaxError: There is an error in your program: * can't assign to function call (FILENAME.py, line 80)
So my question is. Why can't I iteratively use the eval() function to assign my data as an instance into class Transaction(object)?
If there's no way around that, how might else I go about doing so?
I also have a lingering doubt that my mistake his suggests I've missed some point about Object Orientated Programming and when its appropriate to use it. Would I be better just feeding my csv data into a dictionary as its imported and playing around with it from there?
Many thanks!
Huw

Use a transactions = [] list instead, simply .append() new Transaction instances:
transactions = []
for row in datareader:
transactions.append(Transaction(row))
or even:
transactions = [Transaction(row) for row in datareader]
There is no need to create individual variables for each row result.

Related

Create class that appends all entries to a DataFrame

I am having trouble with the following task.
I need to create a class that accomodates student id, name and grades of all students. My idea was to create an empty DataFrame to append the values I add to the class.
I came up with the below.
class Student:
students_archive = pd.DataFrame(index = 'student_id', columns = ['student_id', 'name', 'grade'])
def __init__(self, s_id, name, grade):
self.s_id = s_id
self.name = name
self.grade = grade
st = {'student_id': self.s_id,'name': self.name, 'grade': self.grade}
pd.concat([Student.students_archive, st])
I am however getting the following error:
If using all scalar values, you must pass an index
I dont really understand whats wrong and I have looked it all around, can anybody help me? Thanks
I also cant help but think mine is the wrong approach since the task doesnt actually specify that it needs to be a dataframe, just says that I have to 'create a class that accomodate students name, grade, and id, and create methods to add, remove or upgrade the students values'. Perhaps I can do all of that without creating a dataframe?
Thank you

I can't comment yet so here's an answer.
Running the code shows me 2 errors:
This index = 'student_id' should be more like this index = ['student_id']
This pd.concat([Student.students_archive, st]) does not accept a dictionary rather something like pd.concat([dataframe_0, dataframe_1])
Also I think it's better if you add your dataframe inside the __init__().
So it should result in something like:
class Student:
def __init__(self, s_id, name, grade):
self.s_id = s_id
self.name = name
self.grade = grade
self.students_archive = pd.DataFrame(index = ['student_id'], columns = ['student_id', 'name', 'grade'])
temp_df = pd.DataFrame.from_dict({'student_id': [self.s_id],'name': [self.name], 'grade': [self.grade]})
new_temp_df = pd.concat([self.students_archive, temp_df])
But keep in mind that temp_df and new_temp_df wil probably be garbage collected so, keep what you want by adding self.
As I understand it:
template = pd.DataFrame(index = ['student_id'], columns = ['student_id', 'name', 'grade'])
entries = pd.DataFrame.from_dict({'student_id': [self.s_id],'name': [self.name], 'grade': [self.grade]})
self.student_df = pd.concat([template, entries])
UPDATE:
Having read your comment and the code you linked, it seems like instantiating each student is out of the scope. You seem to be on the right track though.
I would still add the main dataframe inside the __init__(), keeps it more tidy imho. You can still Acceess it from outside the class (more below).
incorporate the functionality needed through methods (add/remove/update), which you started doing.
Whether this approach is correct or not is probably up to your professor and if they forbid libraries like pandas. I don't see anything wrong as it gives you much of what's needed.
So, in code I would suggest something like this:
class Students:
def __init__(self):
# keep in mind that pandas will add another column to the left for row id (0, 1, 2, etc..)
self.df = pd.DataFrame(columns = ['student_id', 'name', 'grade'])
def add_student(self, s_id, nm, gr):
# a new dataframe is returned from .append() containing the new values, we replace our old on with that
self.df = self.df.append({'student_id':s_id, 'name':nm, 'grade':gr}, ignore_index = True)
def remove_student(self, s_id):
# we are basically, getting a new dataframe without the rows that have s_id as their student_id
self.df = self.df[self.df.student_id != s_id]
# this could be broken down into 3 methods, 1 for each (id, name, grade)
def update_student(self, s_id, nm, gr):
# as i dont know how to update a row, i leave this as a placeholder:
self.remove_student(s_id)
self.add_student(s_id, nm, gr)
Accessing the dataframe from outside is as easy as:
S = Student()
print(S.df)

I am having issues with my code working properly and im stuck

I am having a problem with my code and getting it to work. Im not sure if im sorting this correctly. I am trying to sort with out lambda pandas or itemgetter.
Here is my code that I am having issues with.
with open('ManufacturerList.csv', 'r') as man_list:
ml = csv.reader(man_list, delimiter=',')
for row in ml:
manufacturerList.append(row)
print(row)
with open('PriceList.csv', 'r') as price_list:
pl = csv.reader(price_list, delimiter=',')
for row in pl:
priceList.append(row)
print(row)
with open('ManufacturerList.csv', 'r') as service_list:
sl = csv.reader(service_list, delimiter=',')
for row in sl:
serviceList.append(row)
print(row)
new_mfl = (sorted(manufacturerList, key='None'))
new_prl = (sorted(priceList, key='None'))
new_sdl = (sorted(serviceList, key='None'))
for x in range(0, len(new_mfl)):
new_mfl[x].append(priceList[x][1])
for x in range(0, len(new_mfl)):
new_mfl[x].append(serviceList[x][1])
new_list = new_mfl
inventoryList = (sorted(list, key=1))
i have tried to use the def function to try to get it to work but i dont know if im doing it right. This is what i tried.
def new_mfl(x):
return x[0]
x.sort(key=new_mfl)

You can do it like this:
def manufacturer_key(x):
return x[0]
sorted_mfl = sorted(manufacturerList, key=manufacturer_key)
The key argument is the function that extracts the field of the CSV that you want to sort by.

sorted_mfl = sorted(manufacturerList, key=lambda x: x[0])
There are different Dialects and Formatting Parameters that allow to handle input and output of data from comma separated value files; Maybe it could be used in a way with fewer statements using the correct delimiter which depends on the type of data you handle, this would be added to using built in methods like split for string data or another method to sort and manipulate lists, for example for single column data, delimiter=',' separate data by comma and it would iterate trough each value and not as a list of lists when you call csv.reader
['9.310788653967691', '4.065746465800029', '6.6363356879192965', '7.279020237137884', '4.010297786910394']
['9.896092029283933', '7.553018448286675', '0.3268282119829197', '2.348011394854333', '3.964531054345021']
['5.078622663277619', '4.542467725728741', '3.743648062104161', '12.761916277286993', '9.164698479088221']
# out:
column1 column2 column3 column4 column5
0 4.737897984379577 6.078414943611958 2.7021438955897095 5.8736388919905895 7.878958949784588
1 4.436982168483749 3.9453563399358544 12.66647791861843 5.323017508568736 4.156777982870004
2 4.798241413768279 12.690268531982028 9.638858110105895 7.881360524434767 4.2948334000783195
This is achieved because I am using lists that contain singular values, since for columns or lists that are of the form sorted_mfl = {'First Name' : ['name', 'name', 'name'], 'Second Name ' : [...], 'ID':[...]}, new_prl = ['example', 'example', 'example'] new_sdl = [...] the data would be added by something like sorted_mfl + new_prl + new_sdl and since different modules are also used to set and manage comma separated files, you should add more information to your question like the data type you use or create a minimal reproducible example with pandas.

Save values in a dataframe and replace them in a csv

so im not quite sure how to formulate the question, as im quite new in pythong and coding in general.
I have a GUI that displays already available information form a csv:
def updatetext(self):
"""adds information extracted from database already provided"""
df_subj = Content.extract_saved_data(self.date)
self.lineEditFirstDiagnosed.setText(str(df_subj["First_Diagnosed_preop"][0])) \
if str(df_subj["First_Diagnosed_preop"][0]) != 'nan' else self.lineEditFirstDiagnosed.setText('')
self.lineEditAdmNeurIndCheck.setText(str(df_subj['Admission_preop'][0])) \
works great
now, if i chenge values in the GUI, i want them to be updated in the csv.
I started like this:
def onClickedSaveReturn(self):
"""closes GUI and returns to calling (main) GUI"""
df_general = Clean.get_GeneralData()
df_subj = {k: '' for k in Content.extract_saved_data(self.date).keys()} # extract empty dictionary
df_subj['ID'] = General.read_current_subj().id[0]
df_subj['PID'] = df_general['PID_ORBIS'][0]
df_subj['Gender'] = df_general['Gender'][0]
df_subj['Diagnosis_preop'] = df_general['diagnosis'][0]
df_subj["First_Diagnosed_preop"] = self.lineEditFirstDiagnosed.text()
df_subj['Admission_preop'] = self.lineEditAdmNeurIndCheck.text()
df_subj['Dismissal_preop'] = self.DismNeurIndCheckLabel.text()
and this is what my boss added now:
subj_id = General.read_current_subj().id[0] # reads data from curent_subj (saved in ./tmp)
df = General.import_dataframe('{}.csv'.format(self.date), separator_csv=',')
if df.shape[1] == 1:
df = General.import_dataframe('{}.csv'.format(self.date), separator_csv=';')
idx2replace = df.index[df['ID'] == subj_id][0]
# TODO: you need to find a way to turn the dictionaryy df_subj into a dataframe and replace the data at
# the index idxreplace of 'df' with df_subj. Later I would suggest to use line 322 to save everything to the
# file
df.iloc[idx2replace] = pds.DataFrame([df_subj])
df.to_csv("preoperative.csv", index=False)
# df.to_csv(os.path.join(FILEDIR, "preoperative.csv"), index=False)
self.close()
I'm not really sure how to approach this, or to be honest, what to do at all.
Hope someone can help me.
Thank youu

You should load the file only once and keep the DF (self.df or something). Then display it and every time the user changes a value in the GUI the DF should update and when the user clicks save you should just overwrite the existing file with the current DF in memory.

read csv with pandas with this kind of dataset

I have some troubles to read a dataset like this:
# title
# description
# link (could be not still active)
# id
# date
# source (nyt|us|reuters)
# category
example:
court agrees to expedite n.f.l.'s appeal\n
the decision means a ruling could be made nearly two months before the regular season begins, time for the sides to work out a deal without delaying the
season.\n
http://feeds1.nytimes.com/~r/nyt/rss/sports/~3/nbjo7ygxwpc/04nfl.html\n
0\n
04 May 2011 07:39:03\n
nyt\n
sport\n
I tried:
columns = ['title', 'description', 'link', 'id', 'date', 'source', 'category']
df = pd.read_csv('news', delimiter = "\n", names = columns,error_bad_lines=False)
But it put all the information into the columns title.
Do someone knows a way to deal with this ?
Thanks !

You can't use \n as a delimiter for csv, what you could do is set the index equal to the column names, and then transpose, i.e.
df = pd.read_csv('news', index=columns).transpose()

Here are a few things to note:
1) Any delimiter more than 1 character long is interpreted by Pandas a regular expression.
2) Since 'c' engine does not support regex, I have explicitly defined engine as 'python' to avoid warnings.
3) I had to add a dummy column because there is a '\n' at the end of the file, i have later removed that column using drop.
So, these lines will hopefully result what you want.
columns = ['title', 'description', 'link', 'id', 'date', 'source', 'category','dummy']
df = pd.read_csv('news', names=columns, delimiter="\\\\n", engine='python').drop('dummy',axis=1)
df
I hope this helps :)

Check logs with Spark

I'm new to Spark and I'm trying to develop a python script that reads a csv file with some logs:
userId,timestamp,ip,event
13,2016-12-29 16:53:44,86.20.90.121,login
43,2016-12-29 16:53:44,106.9.38.79,login
66,2016-12-29 16:53:44,204.102.78.108,logoff
101,2016-12-29 16:53:44,14.139.102.226,login
91,2016-12-29 16:53:44,23.195.2.174,logoff
And checks if a user had some strange behaviors, for example if he has done two consecutive 'login' without doing 'logoff'. I've loaded the csv as a Spark dataFrame and I wanted to compare the log rows of a single user, ordered by timestamp and checking if two consecutive events are of the same type (login - login , logoff - logoff). I'm searching for doing it in a 'map-reduce' way, but at the moment I can't figure out how to use a reduce function that compares consecutive rows.
The code I've written works, but the performance are very bad.
sc = SparkContext("local","Data Check")
sqlContext = SQLContext(sc)
LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
N_USERS = 10*1000
dataFrame = sqlContext.read.format("com.databricks.spark.csv").load(LOG_FILE_PATH)
dataFrame = dataFrame.selectExpr("C0 as userID","C1 as timestamp","C2 as ip","C3 as event")
wrongUsers = []
for i in range(0,N_USERS):
userDataFrame = dataFrame.where(dataFrame['userId'] == i)
userDataFrame = userDataFrame.sort('timestamp')
prevEvent = ''
for row in userDataFrame.rdd.collect():
currEvent = row[3]
if(prevEvent == currEvent):
wrongUsers.append(row[0])
prevEvent = currEvent
badUsers = sqlContext.createDataFrame(wrongUsers)
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)

First (not related but still), be sure that the number of entries per user is not that big because that collect in for row in userDataFrame.rdd.collect(): is dangerous.
Second, you don't need to leave the DataFrame area here to use classical Python, just stick to Spark.
Now, your problem. It's basically "for each line I want to know something from the previous line": that belongs to the concept of Window functions and to be precise the lag function. Here are two interesting articles about Window functions in Spark: one from Databricks with code in Python and one from Xinh with (I think easier to understand) examples in Scala.
I have a solution in Scala, but I think you'll pull it off translating it in Python:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
import sqlContext.implicits._
val LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
val RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true") // use the header from your csv
.load(LOG_FILE_PATH)
val wSpec = Window.partitionBy("userId").orderBy("timestamp")
val badUsers = data
.withColumn("previousEvent", lag($"event", 1).over(wSpec))
.filter($"previousEvent" === $"event")
.select("userId")
.distinct
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
Basically you just retrieve the value from the previous line and compare it to the value on your current line, if it's a match that is a wrong behavior and you keep the userId. For the first line in your "block" of lines for each userId, the previous value will be null: when comparing with the current value, the boolean expression will be false so no problem here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python- Automating Variable naming for with imported CSV data [duplicate] - python

Use a transactions = [] list instead, simply .append() new Transaction instances: transactions = [] for row in datareader: transactions.append(Transaction(row)) or even: transactions = [Transaction(row) for row in datareader] There is no need to create individual variables for each row result.

Related

Create class that appends all entries to a DataFrame

I am having issues with my code working properly and im stuck

Save values in a dataframe and replace them in a csv

read csv with pandas with this kind of dataset

Check logs with Spark

Categories

Resources