I have many repetitive index values in certain sequence.
I want to fetch the 1st value of my index cell 'ID' and when ever that value repeats, then append that set of values to next column. Here my Index values repeat 4 times, so output will have 4 columns labeled from i=1,2,3,4...till N for N no. of repeat index sets.
The starting value of index cell will differ for other data sets, but the values will repeat in same sequence.
Sample Dataset: df
ID,1
7,0.060896109
10,0.384263675
27,0.780060081
43,0.583200572
57,0.139564176
73,0.595220898
91,0.828783841
7,0.39920022
10,0.157306146
27,0.29750421
43,0.742234942
57,0.971849921
73,0.346905033
91,0.996723279
7,0.192197827
10,0.922323942
27,0.033304593
43,0.462253505
57,0.282632609
73,0.553047118
91,0.07678817
7,0.428707324
10,0.250935035
27,0.529861617
43,0.982468147
57,0.473807591
73,0.340980584
91,0.436675534
Expected Output Sample:
ID,1,2,3,4
7,0.060896109,0.39920022,0.192197827,0.428707324
10,0.384263675,0.157306146,0.922323942,0.250935035
27,0.780060081,0.29750421,0.033304593,0.529861617
43,0.583200572,0.742234942,0.462253505,0.982468147
57,0.139564176,0.971849921,0.282632609,0.473807591
73,0.595220898,0.346905033,0.553047118,0.340980584
91,0.828783841,0.996723279,0.07678817,0.436675534
Use DataFrame.pivot with helper column created by compare by first column with cumulative sum:
df = df.assign(g=np.cumsum(df.index == df.index[0])).pivot(columns='g',values='1')
print (df)
g 1 2 3 4
ID
7 0.060896 0.399200 0.192198 0.428707
10 0.384264 0.157306 0.922324 0.250935
27 0.780060 0.297504 0.033305 0.529862
43 0.583201 0.742235 0.462254 0.982468
57 0.139564 0.971850 0.282633 0.473808
73 0.595221 0.346905 0.553047 0.340981
91 0.828784 0.996723 0.076788 0.436676
You can use pd.pivot_table with a little extra work on the columns:
pd.pivot_table(data=df, index='ID', columns=df.groupby('ID').cumcount(), values='1')
0 1 2 3
ID
7 0.060896 0.399200 0.192198 0.428707
10 0.384264 0.157306 0.922324 0.250935
27 0.780060 0.297504 0.033305 0.529862
43 0.583201 0.742235 0.462254 0.982468
57 0.139564 0.971850 0.282633 0.473808
73 0.595221 0.346905 0.553047 0.340981
91 0.828784 0.996723 0.076788 0.436676
Related
I am new to python and I am having a hard time with this issue and i need your help.
Q1 Q2 Q3 Q4 Q5
25 9 57 23 7
61 41 29 5 57
54 34 58 10 7
13 13 63 26 45
31 71 40 40 40
24 38 63 63 47
31 50 43 2 61
68 33 13 9 63
28 1 30 39 71
I have an excel report with the data above. I'd like to write a code that looks through all columns in the 1st row and output the index number of the column with S in the column value (i.e., 3). I want to use the index number to extract data for that column. I do not want to use row and cell reference as the excel file gets updated regularly, thus d column will always move.
def find_idx():
wb = xlrd.open_workbook(filename='data.xlsx') # open report
report_sheet1 = wb.sheet_by_name('Sheet 1')
for j in range(report_sheet1.ncols):
j=report_sheet1.cell_value(0, j)
if 'YTD' in j:
break
return j.index('Q4')
find_idx()
the i get "substring not found" erro
What i want is to return the column index number (i.e, 3), so that i can call it easily in another code. How can i fix this?
Hass!
As far as I understood, you want to get the index of a column of an excel file whose name contains a given substring such as Y. Is that right?
If so, here's a working snippet that does not requires pandas:
import xlrd
def find_idx(excel_filename, sheet_name, col_name_lookup):
"""
Returns the column index of the first column that
its name contains the string col_name_lookup. If
the col_name_lookup is not found, it returns -1.
"""
wb = xlrd.open_workbook(filename=excel_filename)
report_sheet1 = wb.sheet_by_name(sheet_name)
for col_ix in range(report_sheet1.ncols):
col_name = report_sheet1.cell_value(0, col_ix)
if col_name_lookup in col_name:
return col_ix
return -1
if __name__ == "__main__":
excel_filename = "./data.xlsx"
sheet_name = "Sheet 1"
col_name_lookup = "S"
print(find_idx(excel_filename, sheet_name, col_name_lookup))
I tried to give more semantic names to your variables (I transformed your variable j into two other variables: col_ix (actual column index of the loop) and also the variable col_name which really stands for the column name.
This code assumes that the first line of your excel file contains the column names, and if your desired substring to be looked in each of these names is not found, it returns -1.
I am creating new dataframe which should contain an only middle value (not Median!!) for every nth rows, however my code doesn't work!
I've tried several approaches through pandas or simple Python but I always fail.
value date index
14 40 1983-07-15 14
15 86 1983-07-16 15
16 12 1983-07-17 16
17 78 1983-07-18 17
18 69 1983-07-19 18
19 78 1983-07-20 19
20 45 1983-07-21 20
21 47 1983-07-22 21
22 48 1983-07-23 22
23 ..... ......... ..
RSDF5 = RSDF4.groupby(pd.Grouper(freq='15D', key='DATE')).[int(len(RSDF5)//2)].reset_index()
I know that the code is wrong and I am completely out of ideas!
SyntaxError: invalid syntax
A solution based on indexes.
df is your original dataframe, N is the number of rows you want to group (assumed to be ad odd number, so there is a unique middle row).
df2 = df.groupby(np.arange(len(df))//N).apply(lambda x : x.iloc[len(x)//2])
Be aware that if the total number or rows is not divisible by N, the last group is shorter (you still get its middle value, though).
If N is an even number, you get the central row closer to the end of the group: for example, if N=6, you get the 4th row of each group of 6 rows.
I am working with Pandas data frame for one of my projects.
I have a column name Count having integer values in that column.
I have 720 values for each hour i.e 24 * 30 days.
I want to run a loop which can get initially first 24 values from the data frame and put in a new column and then take the next 24 and put in the new column and then so on.
for example:
input:
34
45
76
87
98
34
output:
34 87
45 98
76 34
here it is a row of 6 and I am taking first 3 values and putting them in the first column and the next 3 in the 2nd one.
Can someone please help with writing a code/program for the same. It would be of great help.
Thanks!
You can also try numpy's reshape method performed on pd.Series.values.
s = pd.Series(np.arange(720))
df = pd.DataFrame(s.values.reshape((30,24)).T)
Or split (specify how many arrays you want to split into),
df = pd.DataFrame({"day" + str(i): v for i, v in enumerate(np.split(s.values, 30))})
This is a snippet of my Data Frame in pandas
SubJob DetectorName CategoryID DefectID Image
0 0 NECK:1 79 5
1 0 NECK:2 79 6
2 0 NECK:3 92 4
3 0 NECK:4 99 123
4 0 NECK:5 99 124
5 1 NECK:6 79 47
6 1 NECK:7 91 631
7 1 NECK:8 98 646
8 1 NECK:9 99 7
9 2 NECK:10 79 15
10 2 NECK:11 89 1023
11 2 NECK:12 79 1040
12 2 NECK:13 79 2458
13 3 NECK:14 73 2459
14 3 NECK:15 87 2517
15 3 NECK:15 79 3117
16 3 NECK:16 79 3118
till n which is very large
We have multiple subjobs whichare sorted inside which we have multiple categoryId which are sorted and inside categoryId we have multiple defectId which are also sorted
I have a separate nested list
[[CategoryId, DefectId, Image-Link] [CategoryId, DefectId, Image-Link] ...m times]
m is large
here category id , defect id represents integer values and image link is string
now i repeatedly pick a categoryId, DefectId from list and find a row in dataframe corresponding to that categoryId, DefectId and add image link in that row
my current code is
for image_info_list in final_image_info_list:
# add path of image in Image_Link
frame_main.ix[(frame_main["CategoryID"].values == image_info_list[0])
&
(frame_main["DefectID"].values == image_info_list[1]),
"Image_Link"] = image_info_list[2]
which is working perfectly but my issue is since n, m is very large it is lot of time to compute it is there any other appropriate approach
can i apply binary search here ? if yes then how
For a fixed n, if m is large enough, you can perform queries more efficiently by some preprocessing.
(I would start with Idea 2 below, because Idea 1 is much more work to implement.)
Idea 1
First, sort the dataframe by [CategoryId, DefectId, Image-Link]. Following that, you can find any triplet by a triple application of a bisect algorithms, one per column, on the column's values.
The cost of what you're doing now is O(m n). The cost of my suggestion is O(n log(n) + m log(n)).
This will work better for some values of m and n, and worse for others. E.g., if m = Θ(n), then your current algorithm is Θ(n2) = ω(n log(n)). YMMV.
Idea 2
Since Image-link is a string sequence, I'm guessing pandas has a harder time searching for specific values within it. You can preprocess by making a dictionary mapping each value to a list of indices within the Dataframe. In the extreme case, where each Image-link value has O(1) rows, this can reduce the time from Θ(mn) to Θ(n + m).
Edit
In the extreme case the OP mentions in the comment, all Image-link values are unique. In this case, it is possible to build a dictionary mapping their values to indices like so:
dict([(k, i) for (i, k) in enumerate(df['Image-link'].values)])
I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.
This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:
df= DataFrame({'id':[1000,1000,1001,2000,2000,2000],
"date":[20010101,20010201,20010115,20010203,20010223,20010220],
"value":[3,1,4,2,6,6],
"seg1":[22,76,23,45,12,53],
"seg2":[23,"",34,52,24,45],
"seg3":[90,"",32,"",34,54],
"seg4":["","",32,"",43,12],
"seg5":["","","","",43,21],
"seg6":["","","","",43,24]})
df
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
1 20010201 1000 76 1
2 20010115 1001 23 34 32 32 4
3 20010203 2000 45 52 2
4 20010223 2000 12 24 34 43 43 41 6
5 20010220 2000 12 24 34 43 44 35 6
And eventually it should be:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 41 6
I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:
for i in df.id.unique():
df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])
But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.
Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.
I tried with groupby('id').max() and it works, and it also drop the rows. Did you remeber to reassign the df variable? Because this operation (and almost all Pandas' operations) are not in-place.
If you do:
df.groupby('id', sort = False).max()
You will get:
date value
id
1000 20010201 3
1001 20010115 4
2000 20010223 6
And if you don't want id as the index, you do:
df.groupby('id', sort = False, as_index = False).max()
And you will get:
id date value
0 1000 20010201 3
1 1001 20010115 4
2 2000 20010223 6
I don't know if that's going to be much faster, though.
Update
This way the index will not be reseted:
df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]
And you will get:
date id seg1 seg2 seg3 seg4 seg5 seg6 value
0 20010101 1000 22 23 90 3
2 20010115 1001 23 34 32 32 4
4 20010223 2000 12 24 34 43 43 43 6