Groupby in Pandas

Groupby in Pandas - python

My code is working, which is good lol, but the output needs to be different in how it is viewed.
UPDATED CODE SINCE RECIEVING ANSWER
import pandas as pd
# Import File
YMM = pd.read_excel('C:/Users/PCTR261010/Desktop/OMIX_YMM_2016.xlsx').groupby(['Make','Model']).agg({'StartYear':'min', 'EndYear':'max'})
print(YMM)
The output looks like Make | Model | StartYear | EndYear, with all the makes listed down column the Make Column next to the Model Column. But the Makes are filtered like a Pivot table.
Here is a screen shot:
I need American Motors next to every American Motors Model, every Buick next to every Buick Model and so on.
Here is the link to sample data:
http://jmp.sh/KLZKWVZ

Try this:
res = YMM.groupby(['Make','Model'], as_index=False).agg({'StartYear':'min', 'EndYear':'max'})
or
res = YMM.groupby(['Make','Model']).agg({'StartYear':'min', 'EndYear':'max'}).reset_index()

With your own code
Min = YMM.groupby(['Make','Model']).StartYear.min()
Max = YMM.groupby(['Make','Model']).EndYear.max()
Min['Endyear']=Max.EndYear

Related

Python, Appending Dataframe in the right order and printing the dataframe as a whole, timeseries forecasting using LSTM

so i'm currently trying to make a timeseries forecasting using LSTM and i'm still on the early stage where i wanted to make sure my data clean.
for the background:
i'm trying to make a model using LSTM for temperature, rain(?), and humidity (my english not good) for 3 Station, and so if i'm correct there will be 9 models, 3 models each for each station. as of now i'm doing an experiment using 1 year worth of data
the problem:
i named my file based on the index of the month, Jan as 1, Feb as 2, Mar as 3, and so on.
Using the os library i managed to loop through the folder for each file and clean the file, drop the column, filling the missing value, etc.
But when i'm trying to append the order of the month is not correct, it starts from month 11 then go to 8 etc. what am i doing wrong?
and how to print a full dataframe? currently i succed printing the full dataframe using this method
Here is the code:
Dir_data = '/content/DATA'
excel_clean = pd.DataFrame()
train_data=[]
for i in os.listdir(Dir_data):
excel_test = pd.read_excel(i)
#drop column
excel_test.drop(columns=['ff_avg', 'ddd_x', 'ddd_car', 'ff_avg', 'ff_x','ss','Tn','Tx'],inplace = True)
#Start Cleaning
excel_test = excel_test.replace(8888,'x').replace(9999,'x').replace('','x')
excel_test['RR'] = pd.to_numeric(excel_test['RR'], errors='coerce').astype('float64')
excel_test['RH_avg'] = pd.to_numeric(excel_test['RH_avg'], errors='coerce').astype('int64')
excel_test['Tavg'] = pd.to_numeric(excel_test['Tavg'], errors='coerce').astype('float64')
#excel_test.dtypes
#Filling Missing Values
excel_test['RR'] = excel_test['RR'].fillna(excel_test['RR'].mean())
excel_test['RH_avg'] = excel_test['RH_avg'].fillna(excel_test['RH_avg'].mean())
excel_test['Tavg'] = excel_test['Tavg'].fillna(excel_test['Tavg'].mean())
excel_test['RR'] = excel_test['RR'].round(decimals=1)
excel_test['Tavg'] = excel_test['Tavg'].round(decimals=1)
excel_clean = excel_clean.append(excel_test)
pd.set_option('max_rows', 99999)
pd.set_option('max_colwidth', 400)
pd.describe_option('max_colwidth')
excel_clean.reset_index(drop=True,inplace=True)
excel_clean
it's only for 1 station as this is an experiment

Python: iterate through the rows of a csv and calculate date difference if there is a change in a column

Only basic knowledge of Python, so I'm not even sure if this is possible?
I have a csv that looks like this:
[1]: https://i.stack.imgur.com/8clYM.png
(This is dummy data, the real one is about 30K rows.)
I need to find the most recent job title for each employee (unique id) and then calculate how long (= how many days) the employee has been on the same job title.
What I have done so far:
import csv
import datetime
from datetime import *
data = open("C:\\Users\\User\\PycharmProjects\\pythonProject\\jts.csv",encoding="utf-8")
csv_data = csv.reader(data)
data_lines = list(csv_data)
print(data_lines)
for i in data_lines:
for j in i[0]:
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
I also know that at one point I will need:
datetime.strptime(data_lines[1][2] , '%Y/%M/%d').date()
Could somebody help, please? I just need a new list saying something like:
id jt days
500 plumber 370
Edit to clarify: The dates are data points taken. I need to calculate back from the most recent of those back until the job title was something else. So in my example for employee 5000 from 04/07/2021 to 01/03/2020.

Let's consider sample data as follows:
id,jtitle,date
5000,plumber,01/01/2020
5000,senior plumber,02/03/2020
6000,software engineer,01/02/2020
6000,software architecture,06/02/2021
7000,software tester,06/02/2019
The following code works.
import pandas as pd
import datetime
# load data
data = pd.read_csv('data.csv')
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
print(data)
# group employees by ID
latest = data.sort_values('date', ascending=False).groupby('id').nth(0)
print(latest)
# find the latest point in time where there is a change in job title
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
print(prev_date)
# calculate the difference in days
latest['days'] = latest.date - prev_date
print(latest)
Output:
jtitle date days
id
5000 senior plumber 2020-03-02 61 days
6000 software architecture 2021-02-06 371 days
7000 software tester 2019-02-06 NaT

But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
Have a map (dict) of employee to (date, title).
For every row, check if you already have an entry for the employee. If you don't just put the information in the map, otherwise compare the date of the row and that of the entry. If the row has a more recent date, replace the entry.
Once you've gone through all the rows, you can just go through the map you've collected and compute the difference between the date you ended up with and "today".
Incidentally your pattern is not correct, the sample data uses a %d/%m/%Y (day/month/year) or %m/%d/%Y (month/day/year) format, the sample data is not sufficient to say which, but it certainly is not YMD.

Seems like I'm too late... Nevertheless, in case you're interested, here's a suggestion in pure Python (nothing wrong with Pandas, though!):
import csv
import datetime as dt
from operator import itemgetter
from itertools import groupby
reader = csv.reader('data.csv')
next(reader) # Discard header row
# Read, transform (date), and sort in reverse (id first, then date):
data = sorted(((i, jtitle, dt.datetime.strptime(date, '%d/%m/%Y'))
for i, jtitle, date in reader),
key=itemgetter(0, 2), reverse=True)
# Process data grouped by id
result = []
for i, group in groupby(data, key=itemgetter(0)):
_, jtitle, end = next(group) # Fetch last job title resp. date
# Search for first ocurrence of different job title:
start = end
for _, jt, start in group:
if jt != jtitle:
break
# Collect results in list with datetimes transformed back
result.append((i, jtitle, end.strftime('%d/%m/%Y'), (end - start).days))
result = sorted(result, key=itemgetter(0))
The result for the input data
id,jtitle,date
5000,plumber,01/01/2020
5000,plumber,01/02/2020
5000,senior plumber,01/03/2020
5000,head plumber,01/05/2020
5000,head plumber,02/09/2020
5000,head plumber,05/01/2021
5000,head plumber,04/07/2021
6000,electrician,01/02/2018
6000,qualified electrician,01/06/2020
7000,plumber,01/01/2004
7000,plumber,09/11/2020
7000,senior plumber,05/06/2021
is
[('5000', 'head plumber', '04/07/2021', 490),
('6000', 'qualified electrician', '01/06/2020', 851),
('7000', 'senior plumber', '05/06/2021', 208)]

Is pandas and numpy any good for manipulation of non numeric data?

I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)

I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers

Moving Average for python issue

I am looking at corona data from the NY Times which can be found here: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
And open for everyone to use.
The dataset is set up like this:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
df2 = df.copy()
df2 = df2.set_index('date')
df2['cases_lagged'] = df2.groupby(['county', 'state'])['cases'].shift()
df2[df2['fips']== 34041.0].head(10)
I was hoping I could create a moving average column the same way using a groupby statement along with the .rolling() command from pandas to compile a 7-day and 14-day moving average for the data but it does not work.
I tried it two separate ways:
#way 1
df2['moving_avg'] = df2.groupby(['county', 'state']).iloc[:4].rolling(window = 7).mean()
#way 2
df2['moving_avg'] = df2.groupby(['county', 'state'])['cases'].rolling(window = 7).mean()
And neither seems to work here.
Any thoughts on how to compile the moving average for each county within each state without having to break out each and every county into its own df for it to work? Thanks

When I ran it with all data, I terminated it because it was running for a long time. We limited the run to specific counties.
I'm not sure I think I've achieved the intended result.
los = df2[(df2['county'] == 'Los Angeles') & (df2['state'] == 'California')]
los['moving_avg'] = los[['county', 'state', 'cases']].groupby(['county', 'state'], group_keys=False).rolling(window = 7).mean()

I Need Assistance With Data Sorting In Python Code

In my Python Code, I would also like Dakota with Hurricane, display appearances to show, in the Data Table, when run in Jupyter Notebook.
I typed the following modification to the Code, aiming to achieve this :-
(df['Spitfire'].str.contains('S', na=True))
Now the Dakota with Hurricane Display booking, i.e. in this case for Worthing - Display, that Data Displays, as does the Dakota Spitfire and Hurricane, and Dakota with Spitfire Display Bookings. But also the Solo Dakota Display bookings, which I don't want to display. What do I type to enable, that when Dakota = 'D' and 'Spitfire' = 'NaN' and 'Hurricane' = 'NaN', that Row is not displayed ?
I have almost managed, to sort out what I need to, in my Python code, for the 2007 Url, I just need, the Dakota with Hurricane bookings issue, sorting out Here is my Code, containing the relevant Url :-
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("http://web.archive.org/web/20070701133815/http://www.bbmf.co.uk/june07.html")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
df = df[1]
df = df.rename(columns=df.iloc[0])
df = df.iloc[2:]
df.head(15)
display = df[(df['Location'].str.contains('- Display')) & (df['Dakota'].str.contains('D')) & (df['Spitfire'].str.contains('S', na=True)) & (df['Lancaster'] != 'L')]
display
Any help would be much appreciated.
Regards
Eddie

You could query your display variable to refine the data:
display = display[~((display['Dakota'] == 'D') & (display["Spitfire"].isnull() & (display['Hurricane'].isnull())))]
where the ~ is used to negate the condition, so that the following query excludes elements from the DataFrame.
You can also include this in your original query on df.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby in Pandas - python

Try this: res = YMM.groupby(['Make','Model'], as_index=False).agg({'StartYear':'min', 'EndYear':'max'}) or res = YMM.groupby(['Make','Model']).agg({'StartYear':'min', 'EndYear':'max'}).reset_index()

With your own code Min = YMM.groupby(['Make','Model']).StartYear.min() Max = YMM.groupby(['Make','Model']).EndYear.max() Min['Endyear']=Max.EndYear

Related

Python, Appending Dataframe in the right order and printing the dataframe as a whole, timeseries forecasting using LSTM

Python: iterate through the rows of a csv and calculate date difference if there is a change in a column

Is pandas and numpy any good for manipulation of non numeric data?

Moving Average for python issue

I Need Assistance With Data Sorting In Python Code

Categories

Resources