Split string and append parts in running list [Python] - python

I have a veery long list that contains the same pattern. Here an original example:
04:50 10 244.685 0 0
05:00 10 244.680 0 0
HBCHa 9990 Seite 762
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 31. Dezember 2000 05:10 10 244.679 0 0
05:20 10 244.688 0 0
05:30 10 244.694 0 0
05:40 10 244.688 0 0
As you can see, there is one line with measurement-data within the string that starts with "Sonntag"
My target is:
04:50 10 244.685 0 0
05:00 10 244.680 0 0
HBCHa 9990 Seite 762
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 31. Dezember 2000
05:10 10 244.679 0 0 !!
05:20 10 244.688 0 0
05:30 10 244.694 0 0
05:40 10 244.688 0 0
I managed to write the txt-file in a list, here called "data_list_splitted", catch this onle line over the whole txt-file, split it and extract the part with the measurements:
for i in data_list_splitted:
if len(i) >= 40:
ii = i.split(";")
txt_line = "%s;%s;%s;%s;%s"%(ii[4],ii[5],ii[6],ii[7],ii[8])
But i don't get it to break this line and add the measurement-values in the running list!
I think this should't be that difficult?
Any ideas?
Thank you very much!

You can create another list and insert values into it
new_data_list_splitted = []
for i in data_list_splitted:
if len(i) >= 40:
ii = i.split(";")
txt_line = "%s;%s;%s;%s;%s"%(ii[0],ii[1],ii[2],ii[3],ii[4])
new_data_list_splitted.append(txt_line)
txt_line = "%s;%s;%s;%s;%s"%(ii[4],ii[5],ii[6],ii[7],ii[8])
new_data_list_splitted.append(txt_line)
else:
new_data_list_splitted.append(i)
print new_data_list_splitted #this will have a new row for measurement value

Related

Character specific conditional check in a string

I have to read and analyse some log files using python which usually containing strings in the following desired format :
date Don Dez 10 21:41:41.747 2020
base hex timestamps absolute
no internal events logged
// version 13.0.0
//28974.328957 previous log file: 21-41-41_Voltage.asc
// Measurement UUID: 9e0029d6-43a0-49e3-8708-3ec70363124c
28976.463987 LoggingString := "Log,5:45 AM, Friday, December 11, 2020,05:45:20.6,65.48,11.99,0.009843,12,0.01078,11.99,0.01114,11.99,0.01096,12,0.009984,4.595,0,1.035,0,0.1745,0,2,OM_2_1,0"
28978.600018 LoggingString := "Log,5:45 AM, Friday, December 11, 2020,05:45:22.7,65.47,11.99,0.009896,12,0.01079,11.99,0.01117,11.99,0.01097,12,0.009965,4.628,0,1.044,0,0.1698,0,2,OM_2_1,0"
However, sometime it occurs that files are created that have undesired formats like below :
date Die Jul 13 08:40:22.878 2021
base hex timestamps absolute
no internal events logged
// version 13.0.0
//1035.595166 previous log file: 08-40-22_Voltage.asc
// Measurement UUID: 2baf3f3f-300a-4f0a-bcbf-0ba5679d8be2
"1203.997816 LoggingString := ""Log" 9:01 am Tuesday July 13 2021 09:01:58.3 24.53 13.38 0.8948 13.37 0.8801 13.37 0.89 13.37 0.9099 13.47 0.8851 4.551 0.00115 0.8165 0 0.2207 0 5 OM_3_2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 "0"""
"1206.086064 LoggingString := ""Log" 9:02 am Tuesday July 13 2021 09:02:00.4 24.53 13.37 0.8945 13.37 0.8801 13.37 0.8902 13.37 0.9086 13.46 0.8849 5.142 0.001185 1.033 0 0.1897 0 5 OM_3_2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 "0"""
OR
date Mit Jun 16 10:11:43.493 2021
base hex timestamps absolute
no internal events logged
// version 13.0.0
// Measurement UUID: fe4a6a97-d907-4662-89f9-bd246aa54a33
10025.661597 LoggingString := """""""Log""" 12:59 PM Wednesday June 16 2021 12:59:01.1 66.14 0.00423 0 0.001206 0 0.001339 0 0.001229 0 0.001122 0 0.05017 0 0.01325 0 0.0643 0 0 OM_2_1_transition 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 """0"""""""
10030.592652 LoggingString := """""""Log""" 12:59 PM Wednesday June 16 2021 12:59:06.1 66.14 11.88 0.1447 11.88 0.1444 11.88 0.1442 11.87 0.005552 11.9 0.00404 2.55 0 0.4712 0 0.09924 0 0 OM_2_1_transition 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 """0"""""""
Since i am only concerned with data below "// Measurement UUID " line, i am using this code to extract data from the string that is of desired format :
files = os.listdir(directory)
files = natsorted(files)
for file in files:
base, ext = os.path.splitext(file)
if file not in processed_files and ext == '.asc':
print("File added:", file)
file_path = os.path.join(directory, file)
count = 0
with open(file_path, 'r') as file_in:
processed_files.append(file)
Output_list = [] # Each string from file is read into this list
Final = [] # Required specific data from each string is isolated & stored here
for line in map(str.strip, file_in):
if "LoggingString" in line:
first_quote = line.index(
'"') # returns the column number where " first appears in the whole string
last_quote = line.index('"', first_quote + 1)
# returns the column value where " appears last in the whole string ( end of line )
# print(first_quote)
Output_list.append(
line[:first_quote].split(maxsplit=1)
+ line[first_quote + 1: last_quote].split(","),
)
Final.append(Output_list[count][7:27])
The undesired format contains one or more whitespaces between each string character as seen above. I guess it is because the log file generator sometime generates a non comma separate file or a comma separated file with error probably, i am not sure.
I tried to put the condition after:
if "LoggingString" in line :
if ',' in line:
first_quote = line.index('"')
last_quote = line.index('"', first_quote + 1)
Output_list.append(line[:first_quote].split(maxsplit=1)
+ line[first_quote + 1: last_quote].split(","),)
Final.append(Output_list[count][7:27])
else:
raise Exception("Error in File")
However, this didn't serve the purpose because if in any other undesired format if there is even one ',' in the string, the program would consider it valid and process it which results in false results.
How do i ensure that after the only files that contain strings in desired format are processed and if others are processed then an error message would be print out ? What type of conditional check could be implemented here ?
You can use pandas.read_csv with a regex separator :
import glob
import pandas as pd
l = []
for f in glob.glob("/tmp/Log*.txt"):
df = (pd.read_csv(f, sep=',|(?<=[\w"])\s+(?=[\w"])',
header=None, skiprows=6, engine="python").iloc[:, 2:28])
df.insert(0, "filename", f.split("\\", )[-1])
l.append(df)
out = pd.concat(l)
Output :

How to read the selected lines from a dataset with non-even intervals?

I've this text file which has an increasing order in second column, but at some points some values repeat itself e.g.,0,12,12,36,... I'm referring to the rows which are separated by 0 0 and then 1 0 and so on. I just want to skip these, while reading the data. So the second column has the increasing value.
Can someone tell me any way to do that in python?
0 0 1 1 1 0 0 0
0 3 0.999551 0.998204 0.995963 2.02497e-06 8.08878e-06 1.81582e-05
0 6 0.999226 0.996908 0.993056 3.50103e-06 1.39702e-05 3.13067e-05
0 9 0.998916 0.995669 0.990283 4.90435e-06 1.95504e-05 4.3739e-05
0 12 0.998613 0.994464 0.987587 6.27845e-06 2.50041e-05 5.58512e-05
0 15 0.998309 0.993255 0.984888 7.63421e-06 3.03781e-05 6.77611e-05
0 18 0.998008 0.992055 0.982214 8.97082e-06 3.56643e-05 7.9433e-05
0 21 0.99771 0.990872 0.979581 1.03001e-05 4.09117e-05 9.09826e-05
0 24 0.997413 0.989692 0.976958 1.16094e-05 4.60742e-05 0.000102324
0 27 0.997111 0.988494 0.974298 1.29506e-05 5.13517e-05 0.000113877
0 30 0.996811 0.987306 0.971666 1.42973e-05 5.66363e-05 0.000125395
0 33 0.996514 0.986129 0.969062 1.56102e-05 6.17854e-05 0.000136606
0 36 0.99622 0.984966 0.96649 1.6868e-05 6.67128e-05 0.000147314
1 0 1 1 1 0 0 0
1 12 0.998615 0.994472 0.987606 1.24824e-05 4.97091e-05 0.000111026
1 24 0.997408 0.989674 0.976917 2.32538e-05 9.22819e-05 0.000204924
1 36 0.996216 0.98495 0.966456 3.37665e-05 0.000133547 0.000294894
1 48 0.995023 0.98024 0.956083 4.41221e-05 0.000173927 0.000381972
1 60 0.993849 0.975622 0.945978 5.45843e-05 0.000214354 0.000467853
1 72 0.992678 0.971031 0.93599 6.49638e-05 0.000254364 0.000552466
1 84 0.991501 0.966432 0.926044 7.5403e-05 0.000294247 0.000635589
1 96 0.990323 0.961846 0.916176 8.55362e-05 0.000332815 0.000715435
1 108 0.989133 0.95723 0.90631 9.602e-05 0.000372371 0.000796123
1 120 0.987925 0.952552 0.89635 0.000106095 0.000410211 0.000872709
1 132 0.986728 0.947946 0.886629 0.000116829 0.000449985 0.000951404
1 144 0.985536 0.943378 0.87706 0.000127786 0.000490311 0.00103029
1 156 0.984333 0.938787 0.867512 0.000138898 0.000531114 0.00110972
1 168 0.983124 0.93419 0.858003 0.000149945 0.000571148 0.00118605
2 0 1 1 1 0 0 0
2 60 0.993889 0.975779 0.946334 0.000122674 0.000481801 0.0010518
2 120 0.98802 0.95292 0.897129 0.000235474 0.000910013 0.0019347
2 180 0.981998 0.929939 0.849324 0.000360693 0.00136728 0.00281767
2 240 0.976087 0.907868 0.805034 0.00048759 0.00180865 0.0036021
2 300 0.970186 0.886203 0.762767 0.000606964 0.00221121 0.0042844
2 360 0.964519 0.865822 0.724262 0.000723555 0.00257783 0.0048463
2 420 0.959195 0.846993 0.689658 0.000830297 0.00290486 0.00533017
2 480 0.953931 0.828808 0.657473 0.000940967 0.00322907 0.00579317
2 540 0.948992 0.812283 0.629672 0.00105503 0.0035387 0.00617566
2 600 0.94387 0.795353 0.601452 0.00116622 0.00381699 0.00650445
2 660 0.938843 0.778862 0.57426 0.00126677 0.00406694 0.00680719
2 720 0.933909 0.762839 0.548423 0.0013606 0.0043114 0.00712883
2 780 0.929153 0.7477 0.525167 0.00145272 0.00455818 0.0074014
2 840 0.924413 0.732931 0.503387 0.00154657 0.00480149 0.00765192
2 900 0.919724 0.718536 0.482191 0.00163803 0.0050077 0.00783869
You can load the file with np.loadtxt and delete the second column with np.delete using the axis 1:
arr = np.loadtxt('test.txt')
arr = np.delete(arr, 1, axis=1)

How to feed multiple files to pandas to filter data and concatenate all the results

I have written a code to perform some data cleaning to get the final columns and values from a tab spaced file.
import matplotlib.image as image
import numpy as np
import tkinter as tk
import matplotlib.ticker as ticker
from tkinter import filedialog
import matplotlib.pyplot as plt
root = tk.Tk()
root.withdraw()
root.call('wm', 'attributes', '.', '-topmost', True)
files1 = filedialog.askopenfilename(multiple=True)
files = root.tk.splitlist(files1)
List = list(files)
%gui tk
for i,file in enumerate(List,1):
d = pd.read_csv(file,sep=None,engine='python')
h = d.drop(d.index[19:])
transpose = h.T
header =transpose.iloc[0]
df = transpose[1:]
df.columns =header
df.columns = df.columns.str.strip()
all_columns = list(df)
df[all_columns] = df[all_columns].astype(str)
k =df.drop(columns =['Op:','Comment:','Mod Type:', 'PN', 'Irradiance:','Irr Correct:', 'Lamp Voltage:','Corrected To:', 'MCCC:', 'Rseries:', 'Rshunt:'], axis=1)
k.head()
I want to run this code to multiple files and do the same and concatenate all the results to one data frame.
for eg, If I select 20 files, then new data frame with one line of header and all the 20 results below with increasing order of the value from the column['Module Temp:'].
It would be great if someone could provide a solution to this problem
Please find the link to sample data:https://drive.google.com/drive/folders/1sL2-CwCGeGm0-fvcpzMVzgFnYzN3wzVb?usp=sharing
The following code shows how to parse the files and extract the data. It doesn't show the tkinter GUI component. files will represent your selected files.
Assumptions:
The first 92 rows of the files are always the measurement parameters
Rows from 93 are the measurements.
The 'Module Temp' for each file is different
The lists will be sorted based on the sort order of mod_temp, so the data will be in order in the DataFrame.
The list sorting uses the accepted answer to Sorting list based on values from another list?
import pandas as p
from patlib import Path
# set path to files
path_ = Path('e:/PythonProjects/stack_overflow/data/so_data/2020-11-16')
# select the correct files
files = path_.glob('*.ivc')
# create lists for metrics
measurement_params = list()
mod_temp = list()
measurements = list()
# iterate through the files
for f in files:
# get the first 92 rows with the measurement parameters
mp = pd.read_csv(f, sep='\t', nrows=91, index_col=0)
# remove the whitespace and : from the end of the index names
mp.index = mp.index.str.replace(':', '').str.strip().str.replace('\\s+', '_')
# get the column header
col = mp.columns[0]
# get the module temp
mt = mp.loc['Module_Temp', col]
# add Modult_Temp to mod_temp
mod_temp.append(float(mt))
# get the measurements
m = pd.read_csv(f, sep='\t', skiprows=92, nrows=3512)
# remove the whitespace and : from the end of the column names
m.columns = m.columns.str.replace(':', '').str.strip()
# add Module_Temp column
m['mod_temp'] = mt
# store the measure parameters
measurement_params.append(mp.T)
# store the measurements
measurements.append(m)
# sort lists based on mod_temp sort order
measurement_params = [x for _, x in sorted(zip(mod_temp, measurement_params))]
measurements = [x for _, x in sorted(zip(mod_temp, measurements))]
# create a dataframe for the measurement parameters
df_mp = pd.concat(measurement_params)
# create a dataframe for the measurements
df_m = pd.concat(measurements).reset_index(drop=True)
df_mp
Title: Comment Op ID Mod_Type PN Date Time Irradiance IrrCorr Irr_Correct Lamp_Voltage Module_Temp Corrected_To MCCC Voc Isc Rseries Rshunt Pmax Vpm Ipm Fill_Factor Active_Eff Aperture_Eff Segment_Area Segs_in_Ser Segs_in_Par Panel_Area Vload Ivld Pvld Frequency SweepDelay SweepLength SweepSlope SweepDir MCCC2 MCCC3 MCCC4 LampI IntV IntV2 IntV3 IntV4 LoadV PulseWidth1 PulseWidth2 PulseWidth3 PulseWidth4 TRef1 TRef2 TRef3 TRef4 MCMode Irradiance2 IrrCorr2 Voc2 Isc2 Pmax2 Vpm2 Ipm2 Fill_Factor2 Active_Eff2 ApertureEff2 LoadV2 PulseWidth12 PulseWidth22 Irradiance3 IrrCorr3 Voc3 Isc3 Pmax3 Vpm3 Ipm3 Fill_Factor3 Active_Eff3 ApertureEff3 LoadV3 PulseWidth13 PulseWidth23 RefCellID RefCellTemp RefCellIrrMM RefCelIscRaw RefCellIsc VTempCoeff ITempCoeff PTempCoeff MismatchCorr Serial_No Soft_Ver
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:12:52 100.007 100 Ref Cell 2400 25.2787 25 1.3669 46.4379 9.13215 0.43411 294.467 331.924 38.3403 8.65732 0.78269 1.89434 1.7106 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.4736 6.87023 6.8645 6 6 6.76 107.683 109.977 0 0 27.2224 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.683 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.683 109.977 WPVS mono C-Si Ref Cell 25.9834 1001.86 0.15142 0.15135 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Solarium SGE24P330 STC Admin MCIND_2021_0074 ModuleType1 NaN 17-09-2020 15:06:12 99.3671 100 Ref Cell 2400 25.3380 25 1.3669 45.2903 8.87987 0.48667 216.763 311.031 36.9665 8.41388 0.77338 1.77510 1.60292 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.405 6.82362 6.8212 6 6 6.6 107.660 109.977 0 0 25.9418 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 4.943 107.660 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 4.943 107.660 109.977 WPVS mono C-Si Ref Cell 25.3315 998.370 0.15085 0.15082 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:11:32 100.010 100 Ref Cell 2400 25.3557 25 1.3669 46.4381 9.11368 0.41608 299.758 331.418 38.3876 8.63345 0.78308 1.89144 1.70798 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.3820 6.87018 6.8645 6 6 6.76 107.683 109.977 0 0 27.2535 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.683 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.683 109.977 WPVS mono C-Si Ref Cell 25.9614 1003.80 0.15171 0.15164 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:14:09 99.9925 100 Ref Cell 2400 25.4279 25 1.3669 46.4445 9.14115 0.43428 291.524 332.156 38.2767 8.67776 0.78236 1.89566 1.71179 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.5044 6.87042 6.8645 6 6 6.76 107.660 109.977 0 0 27.1989 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.660 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.660 109.977 WPVS mono C-Si Ref Cell 26.0274 1000.93 0.15128 0.15121 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
df_m.head()
Voltage Current mod_temp
0 -1.193405 9.202885 25.2787
1 -1.196560 9.202489 25.2787
2 -1.193403 9.201693 25.2787
3 -1.196558 9.201298 25.2787
4 -1.199711 9.200106 25.2787
df_m.tail()
Voltage Current mod_temp
14043 46.30869 0.315269 25.4279
14044 46.31411 0.302567 25.4279
14045 46.31949 0.289468 25.4279
14046 46.32181 0.277163 25.4279
14047 46.33039 0.265255 25.4279
Plot
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8))
sns.scatterplot(x='Current', y='Voltage', data=df_m, hue='mod_temp', s=10)
plt.show()
Note
After doing this, I was having trouble plotting the data because the columns were not float type. However, an error occurred when trying to set the type. Looking back at the data, after row 92, there are multiple headers throughout the two columns.
Row 93: Voltage: Current:
Row 3631: Ref Cell: Lamp I:
Row 7169: Voltage2: Current2:
Row 11971: Ref Cell2: Lamp I2:
Row 16773: Voltage3: Current3:
Row 21575: Ref Cell3: Lamp I3:
Row 26377: Raw Voltage: Raw Current :
Row 29915: WPVS Voltage: WPVS Current:
I went back and used the nrows parameter when creating m, so only the first set of headers and associated measurements are extracted from the file.
I recommend writing a script using the csv module to read each file, and create a new file beginning at each blank row, this will make the files have consistent types of measurements.
This should be a new question, if needed.
There are various ways to do it. You can append one dataframe to another (basically stack one on top of the other), and you can do it in the loop. Here is an example. I use fake dfs but you will use your own
import pandas as pd
import numpy as np
combined = None
for _ in range(5):
# stub df creation -- you will use your real code here
df = pd.DataFrame(columns = ['Module Temp','A', 'B'], data = np.random.random((5,3)))
if combined is None:
# initialize with the first one
combined = df.copy()
else:
# add the next one
combined = combined.append(df, sort = False, ignore_index = True)
combined.sort_values('Module Temp', inplace = True)
Here combined will have all the dfs, sorted by 'Module Temp'

Compare some columns from some tables using python

I need to compare two values MC and JT from 2 tables:
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -8.25705 0.219113 -0.000800014 20.8926 41.4347 5.75852 0 4.13067 0
1 423 17950 18150 210 180 17400 18430 1 0 -4.26426 0.586578 -0.053 77.22 85.2104 22.0534 0 3.551 0
2 468 41790 42020 240 50 41360 42380 0 0 7.82681 0.181248 -0.00269566 90.0646 92.7698 5.0841 0 4.19304 0
and
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -0.846655 0.0218695 2.59898e-05 2.0724 4.1259 0.583259 10 0.412513 0
1 423 17950 18150 210 180 17400 18780 1 0 -0.453311 0.058732 -0.00526783 7.7403 8.52544 2.19627 0 0.354126 0
2 468 41790 42020 240 70 41360 42380 0 0 0.743716 0.0181613 -0.000256186 9.08777 9.21395 0.502506 0 0.419265 0
I need to do it using module csv. I know how to do it using pandas and xlrd, but using csv don't know.
Desire output:
Number_of_strings MC JT
And print strings, where values are different
import csv
old = csv.reader(open('old.csv', 'rb'), delimiter=',')
row1 = old.next()
new = csv.reader(open('new.csv', 'rb'), delimiter=',')
row2 = new.next()
if (row1[8] == row2[8]) and (row1[9] == row2[9]):
continue
else:
print row1[0] + ':' + row1[8] + '!=' + row2[8]
You can try something like the following:
old = list(csv.reader(open('old.csv', 'rb'), delimiter=','))
new = list(csv.reader(open('new.csv', 'rb'), delimiter=','))
old = zip(*old)
new = zip(*new)
print ['%s-%s-%s'%(str(a), str(b), str(c)) for a, b, c in zip(old[0], new[8], old[8]) if b != c]
First, we get a list of lists. zip(*x) will transpose a list of lists. The rest should be easy to decipher ...
You can actually put whatever you want within the string ...

after groupby and sum,how to get the max value rows in `pandas.DataFrame`?

here the df(i updated by real data ):
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150425232836 0PU_IS_PS_44 REQU_51NHAJUV06IMMP16BVE572JM2 17020
>20150128165726 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M 6925
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150309153828 0HR_PA_0 REQU_51385K5F3AGGFVCGHU997QF9M 0
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150307222336 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ 13889
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150405162213 0HR_PA_0 REQU_51FFR7T4YQ2F766PFY0W9WUDM 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150102162140 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>and the relationships between columns is
>OLTPSOURCE--RNR:1>n
>RNR--RQDRECORD:1>N
and my requirement is:
sum the RQDRECORD by RNR;
get the max sum result of every OLTPSOURCE;
Finally, I would draw a graph showing the results of all
sumed largest OLTPSOURCE by time
Thanks everyone, I further explain my problem:
if OLTPSOURCE:RNR:RQDRECORD= 1:1:1
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:1:N
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:N:(N OR 1)
sum RQDRECORD by RNR GROUP first,THEN Find the max result of one OLTPSOURCE,return all the OLTPSOURCE with the max RQDRECORD .
So for the above sample data, I eventually want the result as follows
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
Referring to EdChum's approach, I made some adjustments, the results were as follows, because the amount of data is too big, I did "'RQDRECORD> 100000'" is set, in fact I would like to sort and then take the top 100, but not success
[1]: http://i.imgur.com/FgfZaDY.jpg "result"
You can take the groupby result, call max on this and pass param level=0 or level='clsa' if you prefer, this will return you the max count for that level. However this loses the 'clsb' column so what you can do is merge this back to your grouped result after calling reset_index on the grouped object, you can reorder the resulting df columns by using fancy indexing:
In [149]:
gp = df.groupby(['clsa','clsb']).sum()
result = gp.max(level=0).reset_index().merge(gp.reset_index())
result = result.ix[:,['clsa','clsb','count']]
result
Out[149]:
clsa clsb count
0 a a1 9
1 b b2 8
2 c c2 10
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%Y%m%d%H%M%S')
df_gb = df.groupby(['OLTPSOURCE', 'RNR'], as_index=False).aggregate(sum)
final = pd.merge(df[['TIMESTAMP', 'OLTPSOURCE', 'RNR']], df_gb.groupby(['OLTPSOURCE'], as_index=False).first(), on=['OLTPSOURCE', 'RNR'], how='right').sort('OLTPSOURCE')
final.plot(kind='bar')
plt.show()
print final
TIMESTAMP OLTPSOURCE RNR \
3 2015-06-23 21:52:02 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q
2 2015-01-07 20:13:58 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6
5 2015-06-26 18:55:31 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A
11 2015-04-17 18:49:16 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU
6 2015-03-07 22:23:36 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ
4 2015-07-15 14:41:39 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY
10 2015-01-02 16:21:40 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U
13 2015-04-19 23:07:24 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I
7 2015-06-30 16:34:19 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2
8 2015-04-24 16:22:26 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I
0 2015-01-28 16:57:26 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M
12 2015-02-05 15:06:33 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM
9 2015-06-17 14:37:20 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM
1 2015-07-01 14:42:53 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM
RQDRECORD
3 0
2 14205
5 0
11 0
6 13889
4 25381
10 0
13 22528
7 0
8 0
0 6925
12 6667
9 6
1 2

Categories

Resources