How to shift selected rows to next adjacent column in pandas? - python

df3=pd.read_excel(r'may_2019.xlsx',sheet_name='Sheet2')
Here is Sample of my Pandas Dataframe:
+--------------------------+
| Col1 |
+--------------------------+
| G | 20 mins | 2015 |
| NR | 2 |
| G | 11 mins | 302 |
| TV-MA | 44 mins | Apr 30 |
| G | 198 |
| TV-MA | Apr 30 |
| NR | 2012 |
| NR | 57 mins |
+--------------------------+
there are some exception in data(i.e: 2,198,302)
Output Desired for Given Sample :
+--------+----------+------+-------+-----+
| Rating | Duration | Year | Month | Day |
+--------+----------+------+-------+-----+
| G | 20 | 2015 | | |
| NR | | 2 | | |
| G | 11 | 302 | | |
| TV-MA | 44 | | Apr | 30 |
| G | | 198 | | |
| TV-MA | | | Jan | 20 |
| NR | | 2012 | | |
| NR | 57 | | | |
+--------+----------+------+-------+-----+
Things I've tried
df5=pd.DataFrame(df3.Col1.str.split("|").tolist(),columns=['r','d','y'])
indx=df5.loc[df5.d.str.contains('\d{4}')].index
df6.loc[indx,['d','y']]=df5.loc[indx,['d','y']].shift(1,axis=1)
then I failed to shift date according to my required table
so I tried to create function but that also not worked.
def split_data(input):
newd=input.split("|")
if len(newd)==3:
df['date']=newd[2]
df['du']=newd[1]
df['rating']=newd[0]
if len(newd)==2:
df['rating']=newd[0]
if re.findall('\d{4}',newd[1]):
df['date']=newd[1]
else:
df['du']=newd[1]
return df
Things I've tried doen't provide a complete solution for all cases.
So Does anyone know how to do it with Pandas?

Looking at your inputs, i would first try reading in the data properly - it seems you fail in defining the separators etc. of the excel file

Related

Dividing a dataframe into several dataframes according to date column

I have a dataframe which contains a specific column for the date which is called 'testdate'. And I have a period between two specific date, such as 20110501~20120731.
From a dataframe, I want to divide that dataframe into multiple dataframes according to the year-month of 'testdate'.
For example, if 'testdate' is between 20110501-20110531 then df1, if 'testdate' is between next month, then f2, ... and so on.
For example, a whole dataframe looks like this...
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 1 | 20110528 | 50 |
| 2 | 20110601 | 75 |
| 3 | 20110504 | 100 |
| 4 | 20110719 | 82 |
| 5 | 20111120 | 42 |
| 6 | 20111103 | 95 |
| 7 | 20120520 | 42 |
| 8 | 20120503 | 95 |
But, I want to divide it like this...
[DF1]: name should be 201105
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 1 | 20110528 | 50 |
| 3 | 20110504 | 100 |
[DF2]: name should be 201106
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 2 | 20110601 | 75 |
[DF3]
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 4 | 20110719 | 82 |
[DF4]
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 5 | 20111120 | 42 |
| 6 | 20111103 | 95 |
[DF5]
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 7 | 20120520 | 42 |
| 8 | 20120503 | 95 |
I found some codes for dividing a dataframe according to the quarter, but I could find any codes for my task.
How can I deal with it ? Many thanks to your help.
Create a grouper by slicing yyyymm from testdate then group the dataframe and store each group inside a dict comprehension
s = df['Testdate'].astype(str).str[:6]
dfs = {f'df_{k}': g for k, g in df.groupby(s)}
# dfs['df_201105']
StudentID Testdate Record
0 1 20110528 50
2 3 20110504 100
# dfs['df_201106']
StudentID Testdate Record
1 2 20110601 75

Plot multiple bar plots for multiple columns

I have a dataset that looks roughly like the table below.
I need to create a barplot for each column TS1 to TS5 that counts the number of each item in that column. The items are one of the following: NOT_SEEN NOT_ABLE HIGH_BAR and numerical values between 110 and 140 separated by 2 (so 110, 112, 114 etc).
I have found a way to do this which works fine but what I am asking is if there is a way to create a loop or something so I don't have to copy paste the same code 5 times (for the 5 columns)?
This is what I have tried and working:
num_range = list(range(110,140, 2))
OUTCOMES = ['NOT_SEEN', 'NOT_ABLE', 'HIGH_BAR']
OUTCOMES.extend([str(num) for num in num_range])
OUTCOMES = CategoricalDtype(OUTCOMES, ordered = True)
fig, ax =plt.subplots(2, 3, sharey=True)
fig.tight_layout(pad=3)
This below is what I copy 5 times and only change the title (Testing 1, Testing 2 etc) and TS1 TS2.. (in the first line).
df["outcomes"] = df["TS1"].astype(OUTCOMES)
bpt=sns.countplot(x= "outcomes", data=df, palette='GnBu', ax=ax[0,0])
plt.setp(bpt.get_xticklabels(), rotation=60, size=6, ha='right')
bpt.set(xlabel='')
bpt.set_title('Testing 1')
Then the following code is below the "5" instances of the above.
ax[1,2].set_visible(False)
plt.show()
I am sure there is a way to do this that is much better but I'm new to all this.
Also, I need to make sure the bars of the barplot are ordered going left to right as: NOT_SEEN NOT_ABLE HIGH_BAR and 110, 112, 114 etc
Using python 2.7 (not my choice unfortunately) and pandas 0.24.2.
+----+------+------+----------+----------+----------+----------+----------+
| ID | VIEW | YEAR | TS1 | TS2 | TS3 | TS4 | TS5 |
+----+------+------+----------+----------+----------+----------+----------+
| AA | NO | 2005 | | 134 | | HIGH_BAR | |
+----+------+------+----------+----------+----------+----------+----------+
| AB | YES | 2015 | | | NOT_SEEN | | |
+----+------+------+----------+----------+----------+----------+----------+
| AB | YES | 2010 | 118 | | | | NOT_ABLE |
+----+------+------+----------+----------+----------+----------+----------+
| BB | NO | 2020 | | | | | |
+----+------+------+----------+----------+----------+----------+----------+
| BA | YES | 2020 | | | | NOT_SEEN | |
+----+------+------+----------+----------+----------+----------+----------+
| AA | NO | 2010 | | | | | |
+----+------+------+----------+----------+----------+----------+----------+
| BA | NO | 2015 | | | | | 132 |
+----+------+------+----------+----------+----------+----------+----------+
| BB | YES | 2010 | | HIGH_BAR | | 140 | NOT_ABLE |
+----+------+------+----------+----------+----------+----------+----------+
| AA | YES | 2020 | | | | | |
+----+------+------+----------+----------+----------+----------+----------+
| AB | NO | 2010 | | | | 112 | |
+----+------+------+----------+----------+----------+----------+----------+
| AB | YES | 2015 | | | NOT_ABLE | | HIGH_BAR |
+----+------+------+----------+----------+----------+----------+----------+
| BB | NO | 2020 | | | | 145 | |
+----+------+------+----------+----------+----------+----------+----------+
| BA | NO | 2015 | | 110 | | | |
+----+------+------+----------+----------+----------+----------+----------+
| AA | YES | 2010 | HIGH_BAR | | | NOT_SEEN | |
+----+------+------+----------+----------+----------+----------+----------+
| BA | YES | 2015 | | | | | |
+----+------+------+----------+----------+----------+----------+----------+
| AA | NO | 2020 | | | | 118 | |
+----+------+------+----------+----------+----------+----------+----------+
| BA | YES | 2015 | | 180 | NOT_ABLE | | |
+----+------+------+----------+----------+----------+----------+----------+
| BB | YES | 2020 | | NOT_SEEN | | | 126 |
+----+------+------+----------+----------+----------+----------+----------+
You can put plotting lines in a function and call it in a for loop automatically changing column, title and axis in each iteration:
fig, axes =plt.subplots(2, 3, sharey=True)
fig.tight_layout(pad=3)
def plotting(column, title, ax):
df["outcomes"] = df[column].astype(OUTCOMES)
bpt=sns.countplot(x= "outcomes", data=df, palette='GnBu', ax=ax)
plt.setp(bpt.get_xticklabels(), rotation=60, size=6, ha='right')
bpt.set(xlabel='')
bpt.set_title(title)
columns = ['TS1', 'TS2', 'TS3', 'TS4', 'TS5']
titles = ['Testing 1', 'Testing 2', 'Testing 3', 'Testing 4', 'Testing 5']
for column, title, ax in zip(columns, titles, axes.flatten()):
plotting(column, title, ax)
axes[1,2].set_visible(False)
plt.show()

How to check if next 3 consecutive rows in pandas column have same value?

I have a pandas dataframe with 3 columns - id, date and value.
| id | date | value |
| --- | --- | --- |
| 1001 | 1-04-2021 | 61 |
| 1001 | 3-04-2021 | 61 |
| 1001 | 10-04-2021 | 61 |
| 1002 | 11-04-2021 | 13 |
| 1002 | 12-04-2021 | 12 |
| 1015 | 18-04-2021 | 42 |
| 1015 | 20-04-2021 | 42 |
| 1015 | 21-04-2021 | 43 |
| 2001 | 8-04-2021 | 27 |
| 2001 | 11-04-2021 | 27 |
| 2001 | 12-04-2021 | 27 |
| 2001 | 27-04-2021 | 27 |
| 2001 | 29-04-2021 | 27 |
I want to check how many rows are there for each id where the next 3 or more than 3 next consecutive rows having the same value in value column? Once identified that the next 3 or more consecutive rows are having the same value, flag them as 1 in a separate column else 0.
So the final dataframe would look like the following,
| id | date | value | pattern
| --- | --- | --- | --- |
| 1001 | 1-04-2021 | 61 | 1 |
| 1001 | 3-04-2021 | 61 | 1 |
| 1001 | 10-04-2021 | 61 | 1 |
| 1002 | 11-04-2021 | 13 | 0 |
| 1002 | 12-04-2021 | 12 | 0 |
| 1015 | 18-04-2021 | 42 | 0 |
| 1015 | 20-04-2021 | 42 | 0 |
| 1015 | 21-04-2021 | 43 | 0 |
| 2001 | 8-04-2021 | 27 | 1 |
| 2001 | 11-04-2021 | 27 | 1 |
| 2001 | 12-04-2021 | 27 | 1 |
| 2001 | 27-04-2021 | 27 | 1 |
| 2001 | 29-04-2021 | 27 | 1 |
Try with groupby:
df['pattern'] = (df.groupby(['id', df['value'].diff().ne(0).cumsum()])
['id'].transform('size').ge(3).astype(int)
)
How about this:
def f(x):
x = x.fillna(0)
y = len(x)*[0]
for i in range(len(x)-3):
if x[i+1] == 0 and x[i+2] == 0:
y[i] = 1
y[i+1] = 1
y[i+2] = 1
if x[len(x)-1] == 0 and x[len(x)-2] == 0 and x[len(x)-3] == 0:
y[len(x)-1] = 1
return pd.Series(y)
df['value'].diff().transform(f)

How to select all rows from a Dask dataframe with value equal to minimal value of group

So I have following dask dataframe grouped by Problem column.
| Problem | Items | Min_Dimension | Max_Dimension | Cost |
|-------- |------ |---------------|-------------- |------ |
| A | 7 | 2 | 15 | 23 |
| A | 5 | 2 | 15 | 38 |
| A | 15 | 2 | 15 | 23 |
| B | 11 | 6 | 10 | 54 |
| B | 10 | 6 | 10 | 48 |
| B | 18 | 6 | 10 | 79 |
| C | 50 | 8 | 25 | 120 |
| C | 50 | 8 | 25 | 68 |
| C | 48 | 8 | 25 | 68 |
| ... | ... | ... | ... | ... |
The goal is to create a new dataframe with all rows where the Cost values is minimal for this particular Problem group. So we want following result:
| Problem | Items | Min_Dimension | Max_Dimension | Cost |
|-------- |------ |---------------|-------------- |------ |
| A | 7 | 2 | 15 | 23 |
| A | 15 | 2 | 15 | 23 |
| B | 10 | 6 | 10 | 48 |
| C | 50 | 8 | 25 | 68 |
| C | 48 | 8 | 25 | 68 |
| ... | ... | ... | ... | ... |
How can I achieve this result, i already tried using idxmin() as mentioned in another question on here, but then I get a ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.
What if you create another dataframe that is grouped by Problem and Cost.min()? Let's say the new column is called cost_min.
df1 = df.groupby('Problem')['Cost'].min().reset_index()
Then, merge back this new cost_min column back to the dataframe.
df2 = pd.merge(df, df1, how='left', on='Problem')
From there, do something like:
df_new = df2.loc[df2['Cost'] == df2['cost_min']]
Just wrote some pseudocode, but I think that all works with Dask.

Multi-Index Lookup Mapping

I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |
Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791

Categories

Resources