How to get the gap between years using python pandas

How to get the gap between years using python pandas - python

I am having a data frame like this
Series_id F start end data
3 A 2012 2018 [[2012,0],[2014,0],[2015,1],[2017,3],[2019,0],[2020.1]]
I need Output like this
{series_id:3,start:2013,end:2013,count:1},{series_id:3,start:2016,end:2016,count:1}{series_id:3,start:2018,end:2018,count:1},
I have to iterate over the data frame of each row and find the gap between year in the data column The data column is a list of a list which has the year and some data in each list
I have to check with consecutive and missing years and count.
count example: take series id:3 start=2014 end=2018 on iterating the 2013,2016 and 2018 are missed but in between, 2017 is there so for that I have to get like this
{series_id:3,start:2013,end:2013,count:1},
{series_id:3,start:2016,end:2016,count:1},{series_id:3,start:2018,end:2018,count:1}
How can I achieve this even I not able to iterate it?
Please Need a help

If your question is how to convert a query operation performed on a pandas dataframe into a list. There is a method called as to_json() on pandas.
Find the link below to see examples and implementation.
enter link description here
Use orient = records
Then you can use json.loads() to convert to a list.
Now you can iterate over it.

Related

how can I take certain elements in a larger data frame and make another data Frame with these elements in python?

I am currently working on a project that uses a data Frame of almost 24000 basketball games from the years 2004-2021. what I want to do in the end is make a single data Frame that has only 1 row for each year and the column values will be the mean for that category. What I have so far is a mask function that can separate by year but I want to make a for loop that will go through the list of years, get the mean of that, and then concatenate them into a new data frame. The code might help explain this better.
## now i want to seperate this into data sets based on year so ill make a function this will be used to seperate by year. in my original dataset "SEASON" is the year.
def mask(year):
mask = stats['SEASON']== year
year_mask= stats[mask]
return year_mask
how can I possibly make this into a loop that seperates by year, finds mean clues of all columns in that year, and combines them into 1 data from that should have 18 rows that span from 2004-2021?

If you are using Pandas dataframes it's best to let pandas do the work for you.
I assume you want to calculate the mean of some category in your dataframe grouped by the year. To do this we can create a function like so:
def foo(df, category):
return df.groupby(by=["year"])[category].mean()
If you want to mean all the categories just use:
df.groupby(by=["year"]).mean()

How to drop certain rows from dataframe if they partially meet certain condition

I'm trying to drop rows from dataframe if they 'partially' meet certain condition.
By 'partially' I mean some (not all) values in the cell meet the condition.
Lets' say that I have this dataframe.
>>> df
Title Body
0 Monday report: Stock market You should consider buying this.
1 Tuesday report: Equity XX happened.
2 Corrections and clarifications I'm sorry.
3 Today's top news Yes, it skyrocketed as I predicted.
I want to remove the entire row if the Title has "Monday report:" or "Tuesday report:".
One thing to note is that I used
TITLE = []
.... several lines of codes to crawl the titles.
TITLE.append(headline)
to crawl and store them into dataframe.
Another thing is that my data are in tuples because I used
df = pd.DataFrame(list(zip(TITLE, BODY)), columns =['Title', 'Body'])
to make the dataframe.
I think that's why when I used,
df.query("'Title'.str.contains('Monday report:')")
I got an error.
When I did some googling here in StackOverflow, some advised to convert tuples into multi-index and to use filter(), drop(), or isin().
None of them worked.
Or maybe I used them in a wrong way...?
Any idea to solve this prob?

you can do a basic filter for a condition and then pick reverse of it using ~:
eg:
df[~df['Title'].str.contains('Monday report')] will give you output that excludes all rows that contain 'Monday report' in title.

what is a way to filter csv data by date and time without using pandas, instead using if statements and lists?

I have a csv of support tickets for the whole year.
first column is ticket number, next one is date created, employee name, subject, status etc.
dates are written in this format: 3/25/2021 13:55
I know how to use basic lists of lists and dictionaries in python, and basic if statements, but I don't know how to make a list that contains only tickets summited on the 24th march. I might want to filter by month with a bigger data selection and make a list of all tickets submitted in february, but I dont know how to filter them.
I dont want to use pandas as thats too confusing for me, Im a beginner.
Can I use datetime to do it, or some other way?
'''
opened_file = open(r'Y:\Eric_IT\all tickets this year Tickets - 20210325.csv')
from csv import reader
import datetime
read_file = reader(opened_file)
all_data = list(read_file)
print('These are headers:')
headers = all_data[0]
print(headers)
ticket_data = all_data[1:]
print('Number of tickets this year:')
print(len(ticket_data))
jan = 0
feb = 0
mar = 0
for each in ticket_data:
if each[1] = ??? # [1] is the date field in my excel, and I want to take all february entries and put them in my feb list.
'''

You already have all your data as a list of lists, and separate from the headers which is the first step. The second step is sorting / filtering the data based on some criterion.
Say for example you want to sort based on date. Naturally we'll be using python's built-in list.sort function (as long as we're staying away from libraries like pandas etc...). When you have complex data, it's usually necessary to pass a "key" function to the sort function in order to tell it how you want your data sorted. In this case we need to tell it we want to sort based on the "date" column, so we will need to tell it the sort value is the second element of each row, and that it should convert the string into a numerical date which can then be compared as numbers (think of "key" functions as taking in arbitrary data like an entire row of the csv, and returning a numerical value with which to sort based on. It doesn't have to strictly output a numerical value, but it's often easier if it does and dates can be interpreted as numbers.)
def sort_date(row):
date_string = row[1] #date was second column
dt = datetime.strptime(date_string, "%d/%m/%Y %H:%M") #you'll need to change the format string to match the format excel spits out into the csv for you
return dt
ticket_data.sort(sort_date) #sort ticket_data in-place by date
Sorting on "ticket number" would be even easier, as it already is in a numeric format (though we will probably have to convert it from a string.)
def sort_ticket_no(row):
return int(row[0])
ticket_data.sort(sort_ticket_no) #sort ticket_data in-place by ticket number
Selecting data based on conditions like from the "source" column can be done in a number of ways, but I will show an example of how I might do it.
First I would get all possible "sources" by making a dictionary which will have entries for each "source" type which is a list of rows. Then I would iterate over all your rows, and append each row to the appropriate category:
grouped_by_source = {}
for row in ticket_data:
if row[8] in grouped_by_source: #8th column check if there's an entry yet we can append to
grouped_by_source[row[8]].append(row)
else:
grouped_by_source[row[8]] = [row] #create a new list containing our row if it's the first from a given "source"

You need to convert the dates to datetime objects and then you can use attributes like the month or day to filter. I'd prefer myself to use a list comprehension. The error you mention in the comment means that it didn't convert to a datetime object.
import datetime as dt
jan = [dt.datetime.strptime(item[1], '%Y-%m-%d') for item in ticket_data if dt.datetime.strptime(item[1], '%m') == 1]
https://docs.python.org/3/library/datetime.html
https://www.kite.com/python/examples/2830/datetime-get-the-year,-month,-and-day-of-a-%60date%60

This idiot right here did it!! Look guys! I followed an pandas for beginners tutorial and I pandas-ed the columns I wanted to dates! Now I can count and sort stuff. When I finish celebrating, I will go learn how to do that. Thank you everyone.

Moving x rows up in a dataframe indexed on dates

I have a dataframe that has Date as its index. The dataframe has stock market related data so the dates are not continuous. If I want to move lets say 120 rows up in the dataframe, how do I do that. For example:
If I want to get the data starting from 120 trading days before the start of yr 2018, how do I do that below:
df['2018-01-01':'2019-12-31']
Thanks

Try this:
df[df.columns[df.columns.get_loc('2018-01-01'):df.columns.get_loc('2019-12-31')]]
Get location of both Columns in the column array and index them to get the desired.
UPDATE :
Based on your requirement, make some little modifications of above.
Yearly Indexing
>>> df[df.columns[(df.columns.get_loc('2018')).start:(df.columns.get_loc('2019')).stop]]
Above df.columns.get_loc('2018') will give out numpy slice object and will give us index of first element of 2018 (which we index using .start attribute of slice) and similarly we do for index of last element of 2019.
Monthly Indexing
Now consider you want data for First 6 months of 2018 (without knowing what is the first day), the same can be done using:
>>> df[df.columns[(df.columns.get_loc('2018-01')).start:(df.columns.get_loc('2018-06')).stop]]
As you can see above we have indexed the first 6 months of 2018 using the same logic.

Assuming you are using pandas and the dataframe is sorted by dates, a very simple way would be:
initial_date = '2018-01-01'
initial_date_index = df.loc[df['dates']==initial_date].index[0]
offset=120
start_index = initial_date_index-offset
new_df = df.loc[start_index:]

groupby and extract data [duplicate]

This question already has answers here:
Pandas groupby: How to get a union of strings
(8 answers)
Closed 3 years ago.
new in pandas and I was able to create a dataframe from a csv file. I was also able to sort it out.
What I am struggling now is the following: I give an image as an example from a pandas data frame.
First column is the index,
Second column is a group number
Third column is what happened.
I want based on the second column to take out the third column on the same unique data frame.
I highlight few examples: For the number 9 return back the sequence
[60,61,70,51]
For the number 6 get back the sequence
[65,55,56]
For the number 8 get back the single element 8.
How groupby can be used to do this extraction?
Thanks a lot
Regards
Alex

Starting from the answers on this question we can extract following code to receive the desired result.
dataframe = pd.DataFrame({'index':[0,1,2,3,4], 'groupNumber':[9,9,9,9,9], 'value':[12,13,14,15,16]})
grouped = dataframe.groupby('groupNumber')['value'].apply(list)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get the gap between years using python pandas - python

Related

how can I take certain elements in a larger data frame and make another data Frame with these elements in python?

How to drop certain rows from dataframe if they partially meet certain condition

what is a way to filter csv data by date and time without using pandas, instead using if statements and lists?

Moving x rows up in a dataframe indexed on dates

groupby and extract data [duplicate]

Categories

Resources