Sorting DF by a column with multiple values - python

In my main df, I have a column that is combined with two other columns, creating values that look like this: A1_43567_1. The first number represents a type of assessment taken, the second number being an question ID, and the final number being the question position on an assessment. I plan on creating a pivot table to have each unique value as a column to look across multiple students' selection per each item. But I want the order of the pivot to be by the Question Position, or the third value in the concatenation. Essentially this output:
Student ID A1_45678_1 A1_34551_2 A1_11134_3 etc....
12345 1 0 0
12346 0 0 1
12343 1 1 0
I've tried sorting my data frame by the original column I want it to be sorted by (Question Position) and then creating the pivot table, but that doesn't render the above result I'm looking for. Is there a way to sort the original concatenation values by the third value in the column? Or is it possible to sort a pivot table by the third value in each column?
Current code is:
demo_pivot.sort(['Question Position'], ascending=True)
demo_pivot['newcol'] = 'A' + str(interim_selection) + '_' + ,\
demo_pivot['Item ID'].map(str) + "_" + demo_pivot['Question Position'].map(str)
demo_pivot= pd.pivot_table(demo_pivot, index='Student ANET ID',values='Points Received',\
columns='newcol').reset_index()
But generates this output:
Student ID A1_45678_1 A1_34871_7 A1_11134_15 etc....
12345 1 0 0
12346 0 0 1
12343 1 1 0

The call to pd.pivot_table() returns a DataFrame, correct? If so, can you just reorder the columns of the resulting DataFrame? Something like:
def sort_columns(column_list):
# Create a list of tuples: (question position, column name)
sort_list = [(int(col.split('_')[2]), col) for col in column_list]
# Sorts by the first item in each tuple, which is the question position
sort_list.sort()
# Return the column names in the sorted order:
return [x[1] for x in sort_list]
# Now, you should be able to reorder the DataFrame like so:
demo_pivot = demo_pivot.loc[:, sort_columns(demo_pivot.columns)]

Related

How do I choose a row based off the index in python/pandas?

I have a row in my csv file here:
0 1 2 3 4 5
0 I 55 A 2018-03-10 00:00:00.000 username_in_current_row 2012-01-24 00:00:00.000
I want to write something in python where it only selects rows that have "I" at the index of 0, so do a row count of how many records have "I" there. And I want to print ("X number of records were inserted successfully where an "I" exists.)
So in this case I would do something like:
df = pd.read_csv('file.csv', header=None')
print(df)
This would print the dataframe.
df_row_count = df.shape[0]
Here I want a row count of how many records have "I" in it at the index of 0???
print("X amount of records inserted successfully.")
I think I need an f-string here in place of X that would just count the total number of rows in the table with "I"?
It sounds like you're saying you want to do filter the dataframe. If you want to find where there's an "I" at the index of zero and then print the length, you can do this:
filtered_df = df[df.loc[0, 0]=="I"]
print(f"{len(filtered_df)} records had an "I" at index 0 for column 0.")
The thing is, only the first row will have an index of 0 after your .read_csv() line, so it's 1 at most.
Did you actually want to count how many rows have "I" in the first column? If so, it's more like this:
filtered_df = df[df[0]=="I"]
print(f"{len(filtered_df)} records had an "I" in the first column.")
Hope that helps. :)
To filter the rows with I in column 0 use boolean indexing:
out = df[df[0].eq('I')]
To count them sum the boolean Series to count the True:
count = df[0].eq('I').sum()

To get subset of dataframe based on index of a label

I have a dataframe from yahoo finance
import pandas as pd
import yfinance
ticker = yfinance.Ticker("INFY.NS")
df = ticker.history(period = '1y')
print(df)
This gives me df as,
If I specify,
date = "2021-04-23"
I need a subset of df with row having indexes label "2021-04-23"
rows of 2 days before the date
row of 1 day after of date
The important thing here is, we cannot calculate before & after using date strings as df may not have some dates but rows to be printed based on indexes. (i.e. 2 rows of previous indexes and one row of next index)
For example, in df, there is no "2021-04-21" but "2021-04-20"
How can we implement this?
You can go for integer-based indexing. First find the integer location of the desired date and then take the desired subset with iloc:
def get_subset(df, date):
# get the integer index of the matching date(s)
matching_dates_inds, = np.nonzero(df.index == date)
# and take the first one (works in case of duplicates)
first_matching_date_ind = matching_dates_inds[0]
# take the 4-element subset
desired_subset = df.iloc[first_matching_date_ind - 2: first_matching_date_ind + 2]
return desired_subset
If need before and after values by positions (if always exist date in DatetimeIndex) use DataFrame.iloc with position by Index.get_loc with min and max for select rows if not exist values before 2 or after 1 like in sample data:
df = pd.DataFrame({'a':[1,2,3]},
index=pd.to_datetime(['2021-04-21','2021-04-23','2021-04-25']))
date = "2021-04-23"
pos = df.index.get_loc(date)
df = df.iloc[max(0, pos-2):min(len(df), pos+2)]
print (df)
a
2021-04-21 1
2021-04-23 2
2021-04-25 3
Notice:
min and max are added for not failed selecting if date is first (not exist 2 values before, or second - not exist second value before) or last (not exist value after)

Select rows from Dataframe with variable number of conditions

I'm trying to write a function that takes as inputs a DataFrame with a column 'timestamp' and a list of tuples. Every tuple will contain a beginning and end time.
What I want to do is to "split" the dataframe in two new ones, where the first contains the rows for which the timestamp value is not contained between the extremes of any tuple, and the other is just the complementary.
The number of filter tuples is not known a priori though.
df = DataFrame({'timestamp':[0,1,2,5,6,7,11,22,33,100], 'x':[1,2,3,4,5,6,7,8,9,1])
filt = [(1,4), (10,40)]
left, removed = func(df, filt)
This should give me two dataframes
left: with rows with timestamp [0,5,6,7,100]
removed: with rows with timestamp [1,2,11,22,33]
I believe the right approach is to write a custom function that can be used as a filter, and then call is somehow to filter/mask the dataframe, but I could not find a proper example of how to implement this.
Check
out = df[~pd.concat([df.timestamp.between(*x) for x in filt]).any(level=0)]
Out[175]:
timestamp x
0 0 1
3 5 4
4 6 5
5 7 6
9 100 1
Can't you use filtering with .isin():
left,removed = df[df['timestamp'].isin([0,5,6,7,100])],df[df['timestamp'].isin([1,2,11,22,33])]

Adding new rows in pandas dataframe at specific index

I have read all the answers related to my question available in stackoverflow but my question is little different from available answers. I have very large dataframe and some portion of that dataframe is following-
Input Dataframe is like
A B C D
0 foot 17/1: OGChan_2020011717711829281829281 , 7days ...
1 arm this will processed after ;;;
2 leg go_2020011625692400374400374 16/1: Id Imerys_2020011618188744093744093
3 head xyziemen_2020011510691787006787006 en_2020011510749462801462801 ;;;
: : : :
In this dataframe, firstly I am extracting ID's from column B based upon some regular expression. Some rows of Column B may contain that ID's, some may not and some rows of column B may blank. Following is the code-
df = pd.read_excel("Book1.xlsx", "Sheet1")
dict= {}
for i in df.index:
j = str(df['B'][i])
if(re.findall('_\d{25}', j)):
a = re.findall('_\d{25}', j)
print(a)
dict[i] = a
Regular Expression starts with _(undersore) and 25 digits. Example in above df are _2020011618188744093744093, _2020011510749462801462801 etc..
Now I want to insert these ID's in Column D of a particular row. For Example If two ID's are find at 0th row than first ID should insert in 0th row of column D and second Id should insert on 1st row of column D and all the content of dataframe should shifted down. What I want will clear from following output.I want my output as following based upon above input.
A B .. D
0 foot 17/1: OGChan_2020011717711829281829281 ,7days _2020011717711829281829281
1 arm this will processed after
2 leg go_2020011625692400374400374 16/1: _2020011625692400374400374
Id Imerys_2020011618188744093744093
3 _2020011618188744093744093
4 head xyziemen_2020011510691787006787006 _2020011510691787006787006
en_2020011510749462801462801
5 _2020011510749462801462801
: : : :
In above output 1 ID is found at 0th row.So column D of 0th row contains that ID. No ID is found at first index. So column D of 1st index is empty. At second index there are two ID's. Hence first ID is placed on 2nd row of column D and second ID is placed on 3rd row of column D and it shifted the previous content of third row to 4th row. I want above output as my final output.
Hope I am clear. Thanks in advance

Pivot across multiple columns with repeating values in each column

I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!

Categories

Resources