Compare some columns from some tables using python - python

I need to compare two values MC and JT from 2 tables:
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -8.25705 0.219113 -0.000800014 20.8926 41.4347 5.75852 0 4.13067 0
1 423 17950 18150 210 180 17400 18430 1 0 -4.26426 0.586578 -0.053 77.22 85.2104 22.0534 0 3.551 0
2 468 41790 42020 240 50 41360 42380 0 0 7.82681 0.181248 -0.00269566 90.0646 92.7698 5.0841 0 4.19304 0
and
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -0.846655 0.0218695 2.59898e-05 2.0724 4.1259 0.583259 10 0.412513 0
1 423 17950 18150 210 180 17400 18780 1 0 -0.453311 0.058732 -0.00526783 7.7403 8.52544 2.19627 0 0.354126 0
2 468 41790 42020 240 70 41360 42380 0 0 0.743716 0.0181613 -0.000256186 9.08777 9.21395 0.502506 0 0.419265 0
I need to do it using module csv. I know how to do it using pandas and xlrd, but using csv don't know.
Desire output:
Number_of_strings MC JT
And print strings, where values are different
import csv
old = csv.reader(open('old.csv', 'rb'), delimiter=',')
row1 = old.next()
new = csv.reader(open('new.csv', 'rb'), delimiter=',')
row2 = new.next()
if (row1[8] == row2[8]) and (row1[9] == row2[9]):
continue
else:
print row1[0] + ':' + row1[8] + '!=' + row2[8]

You can try something like the following:
old = list(csv.reader(open('old.csv', 'rb'), delimiter=','))
new = list(csv.reader(open('new.csv', 'rb'), delimiter=','))
old = zip(*old)
new = zip(*new)
print ['%s-%s-%s'%(str(a), str(b), str(c)) for a, b, c in zip(old[0], new[8], old[8]) if b != c]
First, we get a list of lists. zip(*x) will transpose a list of lists. The rest should be easy to decipher ...
You can actually put whatever you want within the string ...

Related

How to feed multiple files to pandas to filter data and concatenate all the results

I have written a code to perform some data cleaning to get the final columns and values from a tab spaced file.
import matplotlib.image as image
import numpy as np
import tkinter as tk
import matplotlib.ticker as ticker
from tkinter import filedialog
import matplotlib.pyplot as plt
root = tk.Tk()
root.withdraw()
root.call('wm', 'attributes', '.', '-topmost', True)
files1 = filedialog.askopenfilename(multiple=True)
files = root.tk.splitlist(files1)
List = list(files)
%gui tk
for i,file in enumerate(List,1):
d = pd.read_csv(file,sep=None,engine='python')
h = d.drop(d.index[19:])
transpose = h.T
header =transpose.iloc[0]
df = transpose[1:]
df.columns =header
df.columns = df.columns.str.strip()
all_columns = list(df)
df[all_columns] = df[all_columns].astype(str)
k =df.drop(columns =['Op:','Comment:','Mod Type:', 'PN', 'Irradiance:','Irr Correct:', 'Lamp Voltage:','Corrected To:', 'MCCC:', 'Rseries:', 'Rshunt:'], axis=1)
k.head()
I want to run this code to multiple files and do the same and concatenate all the results to one data frame.
for eg, If I select 20 files, then new data frame with one line of header and all the 20 results below with increasing order of the value from the column['Module Temp:'].
It would be great if someone could provide a solution to this problem
Please find the link to sample data:https://drive.google.com/drive/folders/1sL2-CwCGeGm0-fvcpzMVzgFnYzN3wzVb?usp=sharing
The following code shows how to parse the files and extract the data. It doesn't show the tkinter GUI component. files will represent your selected files.
Assumptions:
The first 92 rows of the files are always the measurement parameters
Rows from 93 are the measurements.
The 'Module Temp' for each file is different
The lists will be sorted based on the sort order of mod_temp, so the data will be in order in the DataFrame.
The list sorting uses the accepted answer to Sorting list based on values from another list?
import pandas as p
from patlib import Path
# set path to files
path_ = Path('e:/PythonProjects/stack_overflow/data/so_data/2020-11-16')
# select the correct files
files = path_.glob('*.ivc')
# create lists for metrics
measurement_params = list()
mod_temp = list()
measurements = list()
# iterate through the files
for f in files:
# get the first 92 rows with the measurement parameters
mp = pd.read_csv(f, sep='\t', nrows=91, index_col=0)
# remove the whitespace and : from the end of the index names
mp.index = mp.index.str.replace(':', '').str.strip().str.replace('\\s+', '_')
# get the column header
col = mp.columns[0]
# get the module temp
mt = mp.loc['Module_Temp', col]
# add Modult_Temp to mod_temp
mod_temp.append(float(mt))
# get the measurements
m = pd.read_csv(f, sep='\t', skiprows=92, nrows=3512)
# remove the whitespace and : from the end of the column names
m.columns = m.columns.str.replace(':', '').str.strip()
# add Module_Temp column
m['mod_temp'] = mt
# store the measure parameters
measurement_params.append(mp.T)
# store the measurements
measurements.append(m)
# sort lists based on mod_temp sort order
measurement_params = [x for _, x in sorted(zip(mod_temp, measurement_params))]
measurements = [x for _, x in sorted(zip(mod_temp, measurements))]
# create a dataframe for the measurement parameters
df_mp = pd.concat(measurement_params)
# create a dataframe for the measurements
df_m = pd.concat(measurements).reset_index(drop=True)
df_mp
Title: Comment Op ID Mod_Type PN Date Time Irradiance IrrCorr Irr_Correct Lamp_Voltage Module_Temp Corrected_To MCCC Voc Isc Rseries Rshunt Pmax Vpm Ipm Fill_Factor Active_Eff Aperture_Eff Segment_Area Segs_in_Ser Segs_in_Par Panel_Area Vload Ivld Pvld Frequency SweepDelay SweepLength SweepSlope SweepDir MCCC2 MCCC3 MCCC4 LampI IntV IntV2 IntV3 IntV4 LoadV PulseWidth1 PulseWidth2 PulseWidth3 PulseWidth4 TRef1 TRef2 TRef3 TRef4 MCMode Irradiance2 IrrCorr2 Voc2 Isc2 Pmax2 Vpm2 Ipm2 Fill_Factor2 Active_Eff2 ApertureEff2 LoadV2 PulseWidth12 PulseWidth22 Irradiance3 IrrCorr3 Voc3 Isc3 Pmax3 Vpm3 Ipm3 Fill_Factor3 Active_Eff3 ApertureEff3 LoadV3 PulseWidth13 PulseWidth23 RefCellID RefCellTemp RefCellIrrMM RefCelIscRaw RefCellIsc VTempCoeff ITempCoeff PTempCoeff MismatchCorr Serial_No Soft_Ver
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:12:52 100.007 100 Ref Cell 2400 25.2787 25 1.3669 46.4379 9.13215 0.43411 294.467 331.924 38.3403 8.65732 0.78269 1.89434 1.7106 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.4736 6.87023 6.8645 6 6 6.76 107.683 109.977 0 0 27.2224 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.683 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.683 109.977 WPVS mono C-Si Ref Cell 25.9834 1001.86 0.15142 0.15135 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Solarium SGE24P330 STC Admin MCIND_2021_0074 ModuleType1 NaN 17-09-2020 15:06:12 99.3671 100 Ref Cell 2400 25.3380 25 1.3669 45.2903 8.87987 0.48667 216.763 311.031 36.9665 8.41388 0.77338 1.77510 1.60292 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.405 6.82362 6.8212 6 6 6.6 107.660 109.977 0 0 25.9418 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 4.943 107.660 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 4.943 107.660 109.977 WPVS mono C-Si Ref Cell 25.3315 998.370 0.15085 0.15082 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:11:32 100.010 100 Ref Cell 2400 25.3557 25 1.3669 46.4381 9.11368 0.41608 299.758 331.418 38.3876 8.63345 0.78308 1.89144 1.70798 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.3820 6.87018 6.8645 6 6 6.76 107.683 109.977 0 0 27.2535 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.683 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.683 109.977 WPVS mono C-Si Ref Cell 25.9614 1003.80 0.15171 0.15164 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:14:09 99.9925 100 Ref Cell 2400 25.4279 25 1.3669 46.4445 9.14115 0.43428 291.524 332.156 38.2767 8.67776 0.78236 1.89566 1.71179 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.5044 6.87042 6.8645 6 6 6.76 107.660 109.977 0 0 27.1989 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.660 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.660 109.977 WPVS mono C-Si Ref Cell 26.0274 1000.93 0.15128 0.15121 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
df_m.head()
Voltage Current mod_temp
0 -1.193405 9.202885 25.2787
1 -1.196560 9.202489 25.2787
2 -1.193403 9.201693 25.2787
3 -1.196558 9.201298 25.2787
4 -1.199711 9.200106 25.2787
df_m.tail()
Voltage Current mod_temp
14043 46.30869 0.315269 25.4279
14044 46.31411 0.302567 25.4279
14045 46.31949 0.289468 25.4279
14046 46.32181 0.277163 25.4279
14047 46.33039 0.265255 25.4279
Plot
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8))
sns.scatterplot(x='Current', y='Voltage', data=df_m, hue='mod_temp', s=10)
plt.show()
Note
After doing this, I was having trouble plotting the data because the columns were not float type. However, an error occurred when trying to set the type. Looking back at the data, after row 92, there are multiple headers throughout the two columns.
Row 93: Voltage: Current:
Row 3631: Ref Cell: Lamp I:
Row 7169: Voltage2: Current2:
Row 11971: Ref Cell2: Lamp I2:
Row 16773: Voltage3: Current3:
Row 21575: Ref Cell3: Lamp I3:
Row 26377: Raw Voltage: Raw Current :
Row 29915: WPVS Voltage: WPVS Current:
I went back and used the nrows parameter when creating m, so only the first set of headers and associated measurements are extracted from the file.
I recommend writing a script using the csv module to read each file, and create a new file beginning at each blank row, this will make the files have consistent types of measurements.
This should be a new question, if needed.
There are various ways to do it. You can append one dataframe to another (basically stack one on top of the other), and you can do it in the loop. Here is an example. I use fake dfs but you will use your own
import pandas as pd
import numpy as np
combined = None
for _ in range(5):
# stub df creation -- you will use your real code here
df = pd.DataFrame(columns = ['Module Temp','A', 'B'], data = np.random.random((5,3)))
if combined is None:
# initialize with the first one
combined = df.copy()
else:
# add the next one
combined = combined.append(df, sort = False, ignore_index = True)
combined.sort_values('Module Temp', inplace = True)
Here combined will have all the dfs, sorted by 'Module Temp'

Pandas: appending all row values based on a value present in list after randomized sampling

I have a dataframe as df follows
ZZONE ZRET_QTY ZREQ_VAL ACTIVE
BLR 100 22.26 1
BLR 125 25.66 1
BOM 223 29.56 1
BOM 133 26.55 0
BOM 98 18.56 1
BLR 227 29.50 0
Objective:
To have a dataframe where for each ZZONE equal number of 1 and 0 are present as ACTIVE. Such sampling should be randomized. It should look like
ZZONE ZRET_QTY ZREQ_VAL ACTIVE
BLR 100 22.26 1
BLR 227 29.50 0
BOM 223 29.56 1
BOM 133 26.55 0
My Approach so far:
zone_lst = df['ZZONE'].unique.tolist()
for z in zone_lst:
df_z_1 = df[(df['ZZONE']==z) & (df['ACTIVE'] == 1)].sample(frac = 0.5)
df_z_0 = df[(df['ZZONE']==z) & (df['ACTIVE'] == 0)].sample(frac = 0.5)
df_z_f = pd.concat([df_z_1,df_z_0],axis=0) # <--appending data
This is generating a dataframe but only BLR is present. In other words, the loop is only taking the first ZZONE and then creating the dataframe df_z_f.
How to run this loop for all elements in zone_lst?
Edit:
Please be noted that all ZZONE must be present in the final df. And for ACTIVE as 1 and 0 the sampling fraction (i.e. frac) can vary.
Use DataFrame.sample with groupby in custom function:
def func(x):
#count number of each value and return minimal what is passed to n parameter in sample
same = x['ACTIVE'].value_counts().min()
a = x[x['ACTIVE'] == 0].sample(same).index
b = x[x['ACTIVE'] == 1].sample(same).index
return df.loc[a.union(b)]
Another idea:
def func(x):
same = x['ACTIVE'].value_counts().min()
a = x[x['ACTIVE'] == 0].sample(same)
b = x[x['ACTIVE'] == 1].sample(same)
return a.append(b)
df = df.groupby('ZZONE').apply(func).reset_index(drop=True)
print (df)
ZZONE ZRET_QTY ZREQ_VAL ACTIVE
0 BLR 125 25.66 1
1 BLR 227 29.50 0
2 BOM 223 29.56 1
3 BOM 133 26.55 0
EDIT:
def func(x):
a = x[x['ACTIVE'] == 0].sample(frac = 0.4).index
b = x[x['ACTIVE'] == 1].sample(frac = 0.6).index
return df.loc[a.union(b)]
Or:
def func(x):
a = x[x['ACTIVE'] == 0].sample(frac = 0.4)
b = x[x['ACTIVE'] == 1].sample(frac = 0.6)
return a.append(b)
df = df.groupby('ZZONE').apply(func).reset_index(drop=True)
print (df)
ZZONE ZRET_QTY ZREQ_VAL ACTIVE
0 BLR 125 25.66 1
1 BOM 223 29.56 1

How to extract specific rows from an input text file and print them in python?

I have this text file containing transition lines of FeII emissions. The heads are: n_high, n_low, wavelength, intensity (where n_high and n_low are the upper and lower transitions, starting from
2 --> 1,,,371 --> 1,3 --> 2,,,371 --> 2,,, (and so on till the last chunk) 371 --> 370
The input file looks like:
#n_hi n_lo WL(A) logI
2 1 259811.86 1.158
3 1 149730.41 -2.054
4 1 115894.98 -2.134
5 1 102320.80 -2.389
6 1 53387.13 0.256
7 1 41138.69 -0.277
8 1 35226.70 -1.585
9 1 32068.36 -1.741
10 1 12566.77 2.323
.
.
.
.
369 1 1069.66 1.461
370 1 1065.75 -7.901
371 1 1065.64 -8.011
3 2 353390.47 0.759
4 2 209224.17 -2.390
5 2 168797.89 -2.607
.
.
.
370 369 291200.84 -10.337
371 369 283465.88 -10.436
371 370 10672868.00 -12.012
There are in total 68635 rows.
The task here is that I'd like to select only those specific transitions that are within the wavelength range, say [x1,x2] and print the entire row into another file.
So, what I have been able to do is sort of prepare an algorithm to do that:
for n_low from 1 to 370:
for n_hi from n_low+1 to 371:
if x2 <= wavelength <= x1:
print this row to file
else:
exit
I'd like to execute this using python.
if you want to use standard python, something like the function below should work (assuming the data is tab separated):
def filter_wavelength(x1, x2, input_path, output_path):
with open(output_path, 'w') as output_file:
with open(input_path) as input_file:
for line in input_file:
try:
tokens = line.split('\t')
wave_length = float(tokens[2])
if x1 <= wave_length <= x2:
output_file.write(line)
except Exception, e:
print(str(e))
call it like so:
filter_wavelength(1,2,'path/to/input', 'path/to/output')
You can use powerfull pandas
I use io.StringIO to simulate file with data but you have to use filename instead of f
data = '''2 1 259811.86 1.158
3 1 149730.41 -2.054
4 1 115894.98 -2.134
5 1 102320.80 -2.389
6 1 53387.13 0.256
7 1 41138.69 -0.277
8 1 35226.70 -1.585
9 1 32068.36 -1.741
10 1 12566.77 2.323
369 1 1069.66 1.461
370 1 1065.75 -7.901
371 1 1065.64 -8.011
3 2 353390.47 0.759
4 2 209224.17 -2.390
5 2 168797.89 -2.607
370 369 291200.84 -10.337
371 369 283465.88 -10.436
371 370 10672868.00 -12.012'''
import pandas as pd
# simulate file
import io
f = io.StringIO(data)
# use filename instead of `f`
# it reads data from file using spaces as separators
# and add headers 'n_hi','n_lo', 'WL(A)', 'logI'
df = pd.read_csv(f, names=['n_hi','n_lo', 'WL(A)', 'logI'], sep='\s+')
#print(df)
# get rows which have 1000 < WL < 25000
selected = df[ df['WL(A)'].between(1000, 25000) ]
print(selected)
selected.to_csv('result.csv', sep=' ', header=False)
You don't need to care for n_hi and n_lo if your only concern is WL(A), try this:
def extract_wave_lengths(x1, x2, input_file, output_file):
with open(input_file, 'r') as ifile, open(output_file, 'w') as ofile:
next(ifile) # Skip header
for line in ifile:
parts = line.split()
wave_length = float(parts[2])
if x2 <= wave_length <= x1:
ofile.write(line)
You can then call it this way:
extract_wave_lengths(100000, 5000, "/path/to/input/file", "/path/to/output/file")

Split string and append parts in running list [Python]

I have a veery long list that contains the same pattern. Here an original example:
04:50 10 244.685 0 0
05:00 10 244.680 0 0
HBCHa 9990 Seite 762
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 31. Dezember 2000 05:10 10 244.679 0 0
05:20 10 244.688 0 0
05:30 10 244.694 0 0
05:40 10 244.688 0 0
As you can see, there is one line with measurement-data within the string that starts with "Sonntag"
My target is:
04:50 10 244.685 0 0
05:00 10 244.680 0 0
HBCHa 9990 Seite 762
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 31. Dezember 2000
05:10 10 244.679 0 0 !!
05:20 10 244.688 0 0
05:30 10 244.694 0 0
05:40 10 244.688 0 0
I managed to write the txt-file in a list, here called "data_list_splitted", catch this onle line over the whole txt-file, split it and extract the part with the measurements:
for i in data_list_splitted:
if len(i) >= 40:
ii = i.split(";")
txt_line = "%s;%s;%s;%s;%s"%(ii[4],ii[5],ii[6],ii[7],ii[8])
But i don't get it to break this line and add the measurement-values in the running list!
I think this should't be that difficult?
Any ideas?
Thank you very much!
You can create another list and insert values into it
new_data_list_splitted = []
for i in data_list_splitted:
if len(i) >= 40:
ii = i.split(";")
txt_line = "%s;%s;%s;%s;%s"%(ii[0],ii[1],ii[2],ii[3],ii[4])
new_data_list_splitted.append(txt_line)
txt_line = "%s;%s;%s;%s;%s"%(ii[4],ii[5],ii[6],ii[7],ii[8])
new_data_list_splitted.append(txt_line)
else:
new_data_list_splitted.append(i)
print new_data_list_splitted #this will have a new row for measurement value

Group rows in a CSV by blocks of 25

I have a csv file with 2 columns, representing a distribution of items per year which looks like this:
A B
1900 10
1901 2
1903 5
1908 8
1910 25
1925 3
1926 4
1928 1
1950 10
etc, about 15000 lines.
When making a distribution diagram based on this data, it's too many points on an axe, not very pretty. I want to group rows by blocks of 25 years, so that at the end I would have less points at the axe.
So, for example, from 1900 till 1925 I would have a sum of produced items, 1 row in A column and 1 row in B column:
1925 53
1950 15
So far I only figured how to convert the data in csv file to int:
o=open('/dates_dist.csv', 'rU')
mydata = csv.reader(o)
def int_wrapper(mydata):
for v in reader:
yield map(int, v)
reader = int_wrapper(mydata)
Can't find how to do it further...
You could use itertools.groupby:
import itertools as IT
import csv
def int_wrapper(mydata):
for v in mydata:
yield map(int, v)
with open('data', 'rU') as o:
mydata = csv.reader(o)
header = next(mydata)
reader = int_wrapper(mydata)
for key, group in IT.groupby(reader, lambda row: (row[0]-1)//25+1):
year = key*25
total = sum(row[1] for row in group)
print(year, total)
yields
(1900, 10)
(1925, 43)
(1950, 15)
Note that 1900 to 1925 (inclusive) spans 26 years, not 25. So
if you want to group 25 years, given the way you are reporting the totals, you probably want the half-open interval (1900, 1925].
The expression row[0]//25 takes the year and integer divides by 25.
This number will be the same for all numbers in the range [1900, 1925).
To make the range half-open on the left, subtract and add 1: (row[0]-1)//25+1.
Here is my approach . Its definitely not the most engaging python code, but could be a way to achieve the desired output.
if __name__ == '__main__':
o=open('dates_dist.csv', 'rU')
lines = o.read().split("\n") # Create a list having each line of the file
out_dict = {}
curr_date = 0;
curr_count = 0
chunk_sz = 25; #years
if len(lines) > 0:
line_split = lines[0].split(",")
start_year = int(line_split[0])
curr_count = 0
# Iterate over each line of the file
for line in lines:
# Split at comma to get the year and the count.
# line_split[0] will be the year and line_split[1] will be the count.
line_split = line.split(",")
curr_year = int(line_split[0])
time_delta = curr_year-start_year
if time_delta<chunk_sz or time_delta == chunk_sz:
curr_count = curr_count + int(line_split[1])
else:
out_dict[start_year+chunk_sz] = curr_count
start_year = start_year+chunk_sz
curr_count = int(line_split[1])
#print curr_year , curr_count
out_dict[start_year+chunk_sz] = curr_count
print out_dict
You could create a dummy column and group by it after doing some integer division:
df['temp'] = df['A'] // 25
>>> df
A B temp
0 1900 10 76
1 1901 2 76
2 1903 5 76
3 1908 8 76
4 1910 25 76
5 1925 3 77
6 1926 4 77
7 1928 1 77
8 1950 10 78
>>> df.groupby('temp').sum()
A B
temp
76 9522 50
77 5779 8
78 1950 10
My numbers are slightly different from yours since I am technically grouping from 1900-1924, 1925-1949, and 1950-1974, but the idea is the same.

Categories

Resources