Create new proper time-dataset from raw data with Python

Create new proper time-dataset from raw data with Python - python

First of all, I'm sorry if this question has already been asked but I believe my challenge is specific enough. I'm not looking for complete answers but simply guidelines on how I can proceed.
First of all, I have a raw dataset of monitoring participants. This data include things like income, savings, etc... and these participants have been tracked for 6 months (Jan to Jun). But the data is stored in a whole single Excel file with a column to specify the month, which means that one participant's name comes back 6 times in the file, one for each month. Each participant has a unique ID.
I want to transfrom this data in a more workable way and I wanted to learn to do it with Python. But then I feel stuck and rusty because it's been ages since I've coded and I'm only used to the codes I use on a regular basis (printing grouped averages, etc...); Here's the steps I want to follow:
a. Start by creating a column which contains a unique list of participants that have been tracked using the ID. Each participant has to be cited once only;
b. Each participants is recorded with an activity and sub-activity type in the original file, which will need to be added in the new dataset as well;
c. For the month of January for example, I want to create a 'january_income' column in which the income from january has been dragged from the raw dataset, and so on for each variable and each month.
Can anyone provide guidelines on how I may proceed? As I said, it doesn't have to be specific codes, it can be methods or steps along with the function I can use.
Thanks alot already.
N.B: I use Spyder as a working environment.

Your question is not specific. But you can try and adjust the code below:
import csv
"""
Convert your excel file to csv format
This sample assumes that you have a csv file with the first row as header or fieldnames
"""
with open('test.csv','w') as fp:
fp.write("""ID,Name,Income,Savings,Month
1,"Sample Name",1000,100,1
""")
def format(infile = 'infile.csv', outfile='outfile.csv'):
months = ['January', 'February', 'March'] #Add specific months
target_fields = ['Income', 'Savings'] # Add your desired fields
timestamp_field = 'Month' #The field which indicate the month of the row
ID_field = 'ID' # The field which indicates the unique identifier of the participant
part_specific_fields = [ID_field, 'Name'] # The fields which are specific for each participant, these fields won't be touched at all.
target_combined_fields = [f'{month}_{field}' for field in target_fields for month in months]
total_fields = part_specific_fields + target_combined_fields
temp = {}
with open(infile,'r') as fpi, open(outfile,'w') as fpo:
reader = csv.DictReader(fpi)
for row in reader:
ID = int(row[ID_field])
if ID not in temp:
temp[ID] = {}
for other_field in part_specific_fields:
# Insert the constant columns that should not be touched
temp[ID][other_field] = row[other_field]
month_pos = int(row[timestamp_field]) - 1 # subtract 1 for 0 indexing
month = months[month_pos] # Month name in plain English
for field in target_fields:
temp[ID][f'{month}_{field}'] = row[field]
# All the processing completed
#now write the data
writer = csv.DictWriter(fpo, fieldnames=total_fields)
writer.writeheader()
for row in temp.values():
writer.writerow(row)
# File has been wriiten successfully
#now return the mapped dictionary
return temp
print(format('test.csv'))
First, You have to convert your .xls file to .csv format
Process the each row and map that to specific <month>_<field> keys.
Write the processed data to outfile.csv file

Thanks for the notes. First of all, I'm sorry if my post is not specific and thanks for initiating me on the community. Since my initial post, I've made some effort to work on my data and with my actual knowledge of the langage, all I could come up with was a filtering code as my code below shows. This lets me have a column for each data of each month okay but I'm stuck on two things: first, I had to repeat this code for each month and change the months in the labels. I wouldn't have minded that approach if I didnt have to face another problem: This doesn't take in account the fact that some participants have not been tracked on certain months, which means that even if the data was sorted according to ID number, there is a mismatch between the columns because their length vary according to the number of participants tracked for that month. Now I'm looking to optimize this code by adding a line which would let me resolve my second issue (at this point I don't mind if the code is long but if there could be optimization to be made at all, I'm open to it as well):
os.chdir("XXXXXXX")
economique = pd.read_csv('data_economique.csv')
#JANVIER
ID_jan = economique.query("mois_de_suivi == 'Janvier'")["ID"]
nom_jan = economique.query("mois_de_suivi == 'Janvier'")["nom"]
sexe_jan = economique.query("mois_de_suivi == 'Janvier'")["sexe"]
district_jan = economique.query("mois_de_suivi == 'Janvier'")["district"]
activite_jan = economique.query("mois_de_suivi == 'Janvier'")["activite"]
CA_jan = economique.query("mois_de_suivi == 'Janvier'")["chiffre_affaire"]
charges_jan = economique.query("mois_de_suivi == 'Janvier'")["charges"]
resultat_jan = economique.query("mois_de_suivi == 'Janvier'")["benefice"]
remb_attendu_jan = economique.query("mois_de_suivi == 'Janvier'")["remb_attendu"]
remb_effectue_jan = economique.query("mois_de_suivi == 'Janvier'")["remb_effectue"]
remb_differe_jan = economique.query("mois_de_suivi == 'Janvier'")["calcul_remb_differe"]
epargne_jan = economique.query("mois_de_suivi == 'Janvier'")["calcul_epargne"]

Related

Python: iterate through the rows of a csv and calculate date difference if there is a change in a column

Only basic knowledge of Python, so I'm not even sure if this is possible?
I have a csv that looks like this:
[1]: https://i.stack.imgur.com/8clYM.png
(This is dummy data, the real one is about 30K rows.)
I need to find the most recent job title for each employee (unique id) and then calculate how long (= how many days) the employee has been on the same job title.
What I have done so far:
import csv
import datetime
from datetime import *
data = open("C:\\Users\\User\\PycharmProjects\\pythonProject\\jts.csv",encoding="utf-8")
csv_data = csv.reader(data)
data_lines = list(csv_data)
print(data_lines)
for i in data_lines:
for j in i[0]:
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
I also know that at one point I will need:
datetime.strptime(data_lines[1][2] , '%Y/%M/%d').date()
Could somebody help, please? I just need a new list saying something like:
id jt days
500 plumber 370
Edit to clarify: The dates are data points taken. I need to calculate back from the most recent of those back until the job title was something else. So in my example for employee 5000 from 04/07/2021 to 01/03/2020.

Let's consider sample data as follows:
id,jtitle,date
5000,plumber,01/01/2020
5000,senior plumber,02/03/2020
6000,software engineer,01/02/2020
6000,software architecture,06/02/2021
7000,software tester,06/02/2019
The following code works.
import pandas as pd
import datetime
# load data
data = pd.read_csv('data.csv')
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
print(data)
# group employees by ID
latest = data.sort_values('date', ascending=False).groupby('id').nth(0)
print(latest)
# find the latest point in time where there is a change in job title
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
print(prev_date)
# calculate the difference in days
latest['days'] = latest.date - prev_date
print(latest)
Output:
jtitle date days
id
5000 senior plumber 2020-03-02 61 days
6000 software architecture 2021-02-06 371 days
7000 software tester 2019-02-06 NaT

But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
Have a map (dict) of employee to (date, title).
For every row, check if you already have an entry for the employee. If you don't just put the information in the map, otherwise compare the date of the row and that of the entry. If the row has a more recent date, replace the entry.
Once you've gone through all the rows, you can just go through the map you've collected and compute the difference between the date you ended up with and "today".
Incidentally your pattern is not correct, the sample data uses a %d/%m/%Y (day/month/year) or %m/%d/%Y (month/day/year) format, the sample data is not sufficient to say which, but it certainly is not YMD.

Seems like I'm too late... Nevertheless, in case you're interested, here's a suggestion in pure Python (nothing wrong with Pandas, though!):
import csv
import datetime as dt
from operator import itemgetter
from itertools import groupby
reader = csv.reader('data.csv')
next(reader) # Discard header row
# Read, transform (date), and sort in reverse (id first, then date):
data = sorted(((i, jtitle, dt.datetime.strptime(date, '%d/%m/%Y'))
for i, jtitle, date in reader),
key=itemgetter(0, 2), reverse=True)
# Process data grouped by id
result = []
for i, group in groupby(data, key=itemgetter(0)):
_, jtitle, end = next(group) # Fetch last job title resp. date
# Search for first ocurrence of different job title:
start = end
for _, jt, start in group:
if jt != jtitle:
break
# Collect results in list with datetimes transformed back
result.append((i, jtitle, end.strftime('%d/%m/%Y'), (end - start).days))
result = sorted(result, key=itemgetter(0))
The result for the input data
id,jtitle,date
5000,plumber,01/01/2020
5000,plumber,01/02/2020
5000,senior plumber,01/03/2020
5000,head plumber,01/05/2020
5000,head plumber,02/09/2020
5000,head plumber,05/01/2021
5000,head plumber,04/07/2021
6000,electrician,01/02/2018
6000,qualified electrician,01/06/2020
7000,plumber,01/01/2004
7000,plumber,09/11/2020
7000,senior plumber,05/06/2021
is
[('5000', 'head plumber', '04/07/2021', 490),
('6000', 'qualified electrician', '01/06/2020', 851),
('7000', 'senior plumber', '05/06/2021', 208)]

How to divide a pandas data frame into sublists of n at a time?

I have a data frame made of tweets and their author, there is a total of 45 authors. I want to divide the data frame into groups of 2 authors at a time such that I can export them later into csv files.
I tried using the following: (given that the authors are in column named 'B' and the tweets are in columns named 'A')
I took the following from this question
df.set_index(keys=['B'],drop=False,inplace=True)
authors = df['B'].unique().tolist()
in order to separate the lists :
dgroups =[]
for i in range(0,len(authors)-1,2):
dgroups.append(df.loc[df.B==authors[i]])
dgroups.extend(df.loc[df.B ==authors[i+1]])
but instead it gives me sub-lists like this:
dgroups = [['A'],['B'],
[tweet,author],
['A'],['B'],
[tweet,author2]]
prior to this I was able to divide them correctly into 45 sub-lists derived from the previous link 1 as follows:
for i in authors:
groups.append(df.loc[df.B==i])
so how would i do that for 2 authors or 3 authors or like that?
EDIT: from #Jonathan Leon answer, i thought i would do the following, which worked but isn't a dynamic solution and is inefficient i guess, especially if n>3 :
dgroups= []
for i in range(2,len(authors)+1,2):
tempset1=[]
tempset2=[]
tempset1 = df.loc[df.B==authors[i-2]]
if(i-1 != len(authors)):
tempset2=df.loc[df.B ==authors[i-1]]
dgroups.append(tempset1.append(tempset2))
else:
dgroups.append(tempset1)

This imports the foreign language incorrectly, but the logic works to create a new csv for every two authors.
pd.read_csv('TrainDataAuthorAttribution.csv')
# df.groupby('B').count()
authors=df.B.unique().tolist()
auths_in_subset = 2
for i in range(auths_in_subset, len(authors)+auths_in_subset, auths_in_subset):
# print(authors[i-auths_in_subset:i])
dft = df[df.B.isin(authors[i-auths_in_subset:i])]
# print(dft)
dft.to_csv('df' + str(i) + '.csv')

List to dataframe, list to multiple lists, single column to dataframe

Still figuring out programming, help is appreciated! I have a single column of information that i would ultimately like to turn into a dataframe. I could transpose it but the address information varies, it is either 2 lines or 3 lines (some have suite numbers etc).
It generally looks like this.
name x,
ID 1,
123-xyz,
ID 2,
abcdefg,
ACTIVITY,
ggg,
TYPE,
C,
COUNTY,
orange county,
ADDRESS,
123 stack st,
city state zip,
PHONE,
111-111-1111,
EXPIRES,
date,
name y,
ID 1,
456-abc,
ID 2,
cvbnmnb,
ACTIVITY,
ggg,
TYPE,
A,
COUNTY,
dakota county,
ADDRESS,
234 overflow st,
lot a,
city state zip,
PHONE,
000-000-0000,
EXPIRES,
date,
name z,
...,
I was thinking of creating new lists for all desired columns and conditionally appending values with a for loop.
for i in list
if value = ID
append previous value to name list
append next value to ID list
elif value = phone
send next value to phone
elif value = address
evaluate 3 rows down
if value = phone
concatenate previous two values and append to address list
if value != phone
concatenate current and previous 2 values and append to address list
else print error message
Would this be a decently efficient option for lists of around ~20,000 values?
I don't really know how to write this, I am using python in a jupyter notebook. Looking for solutions but also looking to learn more!
-EDIT-
A user had suggested a while loop, and the original data sample I gave was simplified and contained 4 fields. My actual set contained 9, and I tried playing around but unfortunately wasn't able to figure it out on my own.
count = 0 #Pointer to start of a cluster
lengthdf = len(df) #Getting the length of the existing dataframe to use it as the terminating condition
while count != lengthdf:
name = id1 = id2 = activity = type = county = address = phone = expires = "" #Reset the fields for every cluster of information
name = df[0][count] #Name is always the first line of cluster
id1 = df[0][count+2] #id is always third line of cluster
id2 = df[0][count+4]
activity = df[0][count+6]
type = df[0][count+8]
county = df[0][count+10]
n=11
while df[0][count+n] != "Phone": #While row is not 'PHONE', everything else in between is the address, appended and separated by comma.
address=address+df[0][count+n]+", "
n+=1
phone = df[0][count+n+1] #Phone number is always the row after 'PHONE', and is only of 1 line.
expires = df[0][count+n+3]
n+=2
newdf = newdf.append({'NAME': name, 'ID 1': id1, 'ID 2': id2, 'ACTIVITY': activity, 'TYPE': type, 'COUNTY': county, 'ADDRESS': address, 'Phone': phone, 'Expires': expires}, ignore_index=True) #Append the data into the new dataframe
count=count+n

You seem to have a brief understanding of what you need to do judging by the pseudocode you provided!
I'm assuming that your xlsx file looks something like this without the commas.
Based on your sample data, this is what I can come with for you. I'll be referencing each user data as a 'cluster'.
This code works under a few assumptions:
The PHONE field always only have 1 line of data
There is complete data for all cluster (or if there is missing data, a blank exists on the next row).
Data is always in this particular order (i.e. name, ID, address, Phone)
count will be like a pointer to the start of a cluster, while n will be the offset from count. Read the comments for the explanations.
import pandas as pd
df = pd.read_excel (r'test.xlsx', header = None) #Import xlsx file
newdf = pd.DataFrame(columns=['name', 'id', 'address', 'phone']) #Creating blank dataframe
count = 0 #Pointer to start of a cluster
lengthdf = len(df) #Getting the length of the existing dataframe to use it as the terminating condition
while count != lengthdf:
this_add = this_name = this_id = this_phone = "" #Reset the fields for every cluster of information
this_name = df[0][count] #Name is always the first line of cluster
this_id = df[0][count+2] #id is always third line of cluster
n=4
while df[0][count+n] != "PHONE": #While row is not 'PHONE', everything else in between is the address, appended and separated by comma.
this_add=this_add+df[0][count+n]+", "
n+=1
this_phone = df[0][count+n+1] #Phone number is always the row after 'PHONE', and is only of 1 line.
n+=2
newdf = newdf.append({'name': this_name, 'id': this_id, 'address': this_add, 'phone':this_phone}, ignore_index=True) #Append the data into the new dataframe
count=count+n
As for performance wise, I honestly do not think there is much optimisation that can be done given the nature of the dataset (I might be wrong). If you realised my solution is pretty "hard-coded" to reduce the need for if-else statements, but 20,000 lines should not be huge of a problem for Jupyter Notebook. May take a couple of minutes but that should be alright.
I hope this gets you started on tackling other scenarios you may encounter with the remaining datasets!

Is pandas and numpy any good for manipulation of non numeric data?

I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)

I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers

Pythonic way to store and compare csv or xlsx attendance data

Background
I have the following in a .xlsx file when I run a report at work.
A1 - First Name
B1 - Last Name
C1 - Date Attended
Each row contains the data for each person that attended one of our events. I am building a program in python that will take a master.xlsx file and compare it with another .xlsx and give me the following output.
A .txt file with anyone who hasn't attended an event in the past 2 weeks.
A .txt file with anyone who hasn't attended an event in the past 4 weeks.
A .txt file with anyone who hasn't attended an event in 4+ weeks.
A new master.xlsx file with First_Name, Last_Name, Last_Date_Attended
The second .xlsx report is run weekly but actually has a month's worth of attendance data in it. That means if Joe Blow attended 6 times in a month Joe Blow will return 6 rows in the .xlsx file, each with a unique date. So I am going to iterate over the data, compare the dates and only keep the most recent one.
The Question
I have actually already done the above and my first inclination was to turn it into a dictionary in a dictionary. Where the last name is the key with values of 'first', 'date', 'total attended'. Total attended is calculated as part of the for loop.
But a dictionary in a dictionary just doesn't feel pythonic. I feel like I'm hacking a way around a simpler solution. Especially, once I begin writing the output files. Accessing values of a dict in dict doesn't feel right.
Thoughts or suggestions on a better way?
Here's a sample of the code I wrote last night:
data = [This is the data from the .xlsx as a list of lists]
final_data = dict()
dict_errors = 0
for i in data:
if i[1] in final_data:
final_data[i[1]]['total'] = final_data[i[1]]['total'] + 1
if final_data[i[1]]['date'] < i[2]:
final_data[i[1]]['date'] = i[2]
else:
final_data[i[1]] = {
'first': i[0],
'date': i[2],
'total': 1
}
else:
dict_errors += 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.