I have read all the answers related to my question available in stackoverflow but my question is little different from available answers. I have very large dataframe and some portion of that dataframe is following-
Input Dataframe is like
A B C D
0 foot 17/1: OGChan_2020011717711829281829281 , 7days ...
1 arm this will processed after ;;;
2 leg go_2020011625692400374400374 16/1: Id Imerys_2020011618188744093744093
3 head xyziemen_2020011510691787006787006 en_2020011510749462801462801 ;;;
: : : :
In this dataframe, firstly I am extracting ID's from column B based upon some regular expression. Some rows of Column B may contain that ID's, some may not and some rows of column B may blank. Following is the code-
df = pd.read_excel("Book1.xlsx", "Sheet1")
dict= {}
for i in df.index:
j = str(df['B'][i])
if(re.findall('_\d{25}', j)):
a = re.findall('_\d{25}', j)
print(a)
dict[i] = a
Regular Expression starts with _(undersore) and 25 digits. Example in above df are _2020011618188744093744093, _2020011510749462801462801 etc..
Now I want to insert these ID's in Column D of a particular row. For Example If two ID's are find at 0th row than first ID should insert in 0th row of column D and second Id should insert on 1st row of column D and all the content of dataframe should shifted down. What I want will clear from following output.I want my output as following based upon above input.
A B .. D
0 foot 17/1: OGChan_2020011717711829281829281 ,7days _2020011717711829281829281
1 arm this will processed after
2 leg go_2020011625692400374400374 16/1: _2020011625692400374400374
Id Imerys_2020011618188744093744093
3 _2020011618188744093744093
4 head xyziemen_2020011510691787006787006 _2020011510691787006787006
en_2020011510749462801462801
5 _2020011510749462801462801
: : : :
In above output 1 ID is found at 0th row.So column D of 0th row contains that ID. No ID is found at first index. So column D of 1st index is empty. At second index there are two ID's. Hence first ID is placed on 2nd row of column D and second ID is placed on 3rd row of column D and it shifted the previous content of third row to 4th row. I want above output as my final output.
Hope I am clear. Thanks in advance
Related
Let's say I have a large data set that follows a similar structure:
where the id repeats multiple times. I would like to select any id where the value in column b changed with the desired output as such:
How might I be able to achieve that via pandas?
It is not entirely clear what you are asking for. You say
I would like to select any id where the value in column b changed
but 'changed' from what?
Perhaps the following can be helpful -- it will show you all unique ColumnB strings for each id
Using a sample df:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2], 'colb':['a','a','b','c','d','c','c']})
we use groupby and unique:
df.groupby('id')['colb'].unique().explode().to_frame()
output:
colb
id
1 a
1 b
2 c
2 d
so for id=1 we have a and b as unique phrases, and for id=2 we have c and d
I need to identify different groups in Excel files and rows inside these groups (to be more accurate I need to get the value of the first cell of the main row under which over rows are grouped).
Below is an example of the files structure (I've minimized the groups but when I receive these files they are expanded):
I know how to create new groups using openpyxl or xlwt, I'm familiar with both openpyxl and xlrd but I'm enable to find anything in the API to solve this requirement.
So, is it possible using Python and if so, which part of openpyxl or xlrd API should I use ?
You should be able to do this using the worksheet's row_dimensions. This returns an object accessible like a dict where the keys are the row numbers of the sheet. outline_level will have a non-zero value for each depth of grouping, or 0 if the row is not part of a group.
So, if you had a sheet where rows 2 and 3 were a group, and rows 5 and 6 were another group, iterating through row_dimensions would look like this:
>>> for row in range(ws.min_row, ws.max_row + 1):
... print(f"row {row} is in group {ws.row_dimensions[row].outline_level}")
...
row 1 is in group 0
row 2 is in group 1
row 3 is in group 1
row 4 is in group 0
row 5 is in group 1
row 6 is in group 1
I should point out that there's some weirdness with accessing the information. My original solution was this:
>>> for row_num, row_data in ws.row_dimensions.items():
... print(f"row {row_num} is group {row_data.outline_level}")
...
row 2 is group 1
row 3 is group 1
row 4 is group 0
row 5 is group 1
row 6 is group 1
Notice that row 1 is missing. It wasn't part of row_dimensions until I manually accessed it as row_dimensions[1] and then it appeared. I don't know how to explain that, but the first approach is probably better as it specifically iterates from the first to last row.
The same process applies to column groups through column_dimensions except that it must be keyed using column letter(s), e.g. ws.column_dimensions["A"].current_level.
enter image description here
enter image description here
I am trying to add rows where there is a gap between month_count. For example, row 0 has month_count = 0 and row 1 has month_count = 7. How can I add extra 6 rows with month counts being 1,2,3,4,5,6? Also, same situation from row 3 to row 4. I would like to add 2 extra rows with month_count 10 and 11. What is the best way to go about this?
One way to do this would be to iterate over all of the rows and re-build the DataFrame with the missing rows inserted. Pandas does not support the direct insertion of rows at an index, however you can hack together a solution using pd.concat():
def pandas_insert(df, idx, row_contents):
top = df.iloc[:idx]
bot = df.iloc[idx:]
inserted = pd.concat([top, row_contents, bot], ignore_index=True)
return inserted
Here row_contents should be a DataFrame with one (or more) rows. We use ignore_index=True to update the index of the new DataFrame to be labeled 0,1, …, n-2, n-1
I have data as shown below. I would like to select rows based on two conditions.
1) rows that start with digits (1,2,3 etc)
2) previous row of the records that satisfy 1st condition
Please find the how the input data looks like
Please find how I expect the output to be
I tried using the shift(-1) function but it seems to be throwing error. I am sure I messed up with the logic/syntax. Please find the code below that I tried
# i get the index of all records that start with number.
s=df1.loc[df1['VARIABLE'].str.contains('^\d')==True].index
# now I need to get the previous record of each group but this is
#incorrect
df1.loc[((df1['VARIABLE'].shift(-1).str.contains('^\d')==False) &
(df1['VARIABLE'].str.contains('^\d')==True))].index
Use:
df1 = pd.DataFrame({'VARIABLE':['studyid',np.nan,'age_interview','Gender','1.Male',
'2.Female',np.nan, 'dob', 'eth',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
#first remove missing rows by column VARIABLE
df1 = df1.dropna(subset=['VARIABLE'])
#test startinf numbers
s = (df1['VARIABLE'].str.contains('^\d'))
#chain shifted values by | for OR
mask = s | s.shift(-1)
#filtering by boolean indexing
df1 = df1[mask]
print (df1)
VARIABLE
3 Gender
4 1.Male
5 2.Female
9 Ethnicity
10 1.Chinese
11 2.Indian
12 3.Malay
In my main df, I have a column that is combined with two other columns, creating values that look like this: A1_43567_1. The first number represents a type of assessment taken, the second number being an question ID, and the final number being the question position on an assessment. I plan on creating a pivot table to have each unique value as a column to look across multiple students' selection per each item. But I want the order of the pivot to be by the Question Position, or the third value in the concatenation. Essentially this output:
Student ID A1_45678_1 A1_34551_2 A1_11134_3 etc....
12345 1 0 0
12346 0 0 1
12343 1 1 0
I've tried sorting my data frame by the original column I want it to be sorted by (Question Position) and then creating the pivot table, but that doesn't render the above result I'm looking for. Is there a way to sort the original concatenation values by the third value in the column? Or is it possible to sort a pivot table by the third value in each column?
Current code is:
demo_pivot.sort(['Question Position'], ascending=True)
demo_pivot['newcol'] = 'A' + str(interim_selection) + '_' + ,\
demo_pivot['Item ID'].map(str) + "_" + demo_pivot['Question Position'].map(str)
demo_pivot= pd.pivot_table(demo_pivot, index='Student ANET ID',values='Points Received',\
columns='newcol').reset_index()
But generates this output:
Student ID A1_45678_1 A1_34871_7 A1_11134_15 etc....
12345 1 0 0
12346 0 0 1
12343 1 1 0
The call to pd.pivot_table() returns a DataFrame, correct? If so, can you just reorder the columns of the resulting DataFrame? Something like:
def sort_columns(column_list):
# Create a list of tuples: (question position, column name)
sort_list = [(int(col.split('_')[2]), col) for col in column_list]
# Sorts by the first item in each tuple, which is the question position
sort_list.sort()
# Return the column names in the sorted order:
return [x[1] for x in sort_list]
# Now, you should be able to reorder the DataFrame like so:
demo_pivot = demo_pivot.loc[:, sort_columns(demo_pivot.columns)]