Find an efficient way of searching in nested python lists - python

I am very new to this forum and am basically a Network Engineer learning Python to automate some tasks and make my work more efficient. Well, straight to the point. I have a big excel workbook of 4 sheets with around 50K rows in each sheet. After learning for couple of weeks and extensive search I was able to load the whole excel cell values in a nested list e.g.
list [sheet_index][row_index][column_index].
Now after getting the inputs, next part is manipulation of those data. My task is to find specific column value from each row and search in the entire workbook and if found, corresponding data from a different column should be written in line with the original searched object.
My method is like below:
Getting the cell values in a big list (as I mentioned earlier)
flatten that list in a different variable as a one dimensional list.
in a loop, get the specific value from a row (fixed column) and search in entire one-dimensional list, if found, write the corresponding value in a different excel file.
So far, this method is working fine with a extra long delay which was the motivation for drifting from Excel VBA program to Python. So, I am here to ask the experts if theres something very basic I am missing. Here is the code below:
import xlrd
import xlwt
from compiler.ast import flatten
datafile = 'Peering_DB.xls'
# Data Read Function Definition
def main(datafile):
wb = xlrd.open_workbook(datafile)
wwb = copy(wb)
data = [[[wb.sheet_by_index(i).cell_value(r, col)
for col in range(wb.sheet_by_index(i).ncols)]
for r in range(wb.sheet_by_index(i).nrows)]
for i in range(0,4)]
data1 = flatten(data)
k = 2
x = 0
while x < 4:
r = wb.sheet_by_index(x).nrows
A = data[x][k][1]
B = data[x][k][2]
counter = 4
loc = [loc for (loc , e ) in enumerate(data1) if e == A]
if len(loc) != 1:
for n in range(len(loc)):
if data1[loc[n] + 1] != B:
wwb.get_sheet(x).write(k,counter,data1[loc[n] + 1])
counter = counter + 1
else:
wwb.get_sheet(x).write(k,counter,"No Backup")
k = k + 1
if k == r - 1 and x < 3:
print 'Page number ', x , 'Completed'
x = x + 1
k = 2
elif k == r and x == 3:
print "Operation Completed Successfully"
break
wwb.save('Peering_output.xls')
main(datafile)

Related

Searching through strings in a dataframe and increasing the numbers found by 1

I have a dataframe that I have created by hand. I am working on a code that copies the dataframe and concatenates the new dataframe to the end of the first one. For now, I need the code to look through each value of a column of the 'Name' dataframe that contains strings and if there is a number in the string, increase this number by 1. I need the number to be turned into an int so that I can create a function that will look through the dataframe and automatically add 1 to the largest number in the dataframe. An example:
import pandas as pd
data = {'ID': [1,2,3,4],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp']}
df = pd.DataFrame(data)
df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
Afterwards the new df looks like
data2 = {'ID': [1,2,3,4,5,6,7,8],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp','BN #2', 'HHC', 'A comp', 'B Comp']}
When I run this, I receive a 'NoneType' object is not subscriptable error. This makes sense because only the BN # row has a number and re.search returns None when the string parameters are not met, but I cannot figure out how to tell python to ignore the other rows.
EDIT
Only the first row each dataframe will increase by 1, so if there is an easier way where I do not use re.search, that is fine. I know there are a couple ways of doing this but I want to be able to always look through the string value of BN and increase it by 1 every time I run the code.
REGEX EDIT
df2['BaseName'] = [re.sub('\d', '', x) for x in df2['Name'].values]
df['BaseName'] = [re.sub('\d', '', x) for x in df['Name'].values]
df2['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df2['Name'].values]
# df2['SysNum'] = df2['Name'].get(r'(?<=#)\d').astype(int)
# df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
df['SysNum'] = df['Name'].str.contains('(?<=#)\d').astype(int)
m = re.search(r'(?<=#)\d', df2['Name'].iloc[0])
if m:
df2['SysNum'] = int(m.group(0)) + 1
n = re.search(r'(?<=#)\d', df['Name'].iloc[0])
if n:
df['SysNum'] = int(n.group(1)) + 1
new_names = df2['BaseName'].unique()
maxes2 = np.zeros((len(new_names), ))
for j in range(len(new_names)):
un2 = new_names[j]
maxes2[j] = df['SysNum'].loc[df['BaseName'] == un2].max()
df2['SysNum'].loc[df2['BaseName'] == un2] = np.linspace(1, len(df2['SysNum'].loc[df2['BaseName'] == un2]), len(df2['SysNum'].loc[df2['BaseName'] == un2]))
df2['SysNum'].loc[df2['BaseName'] == un2] += maxes2[j]
newnames2 = [s + '%d' % num for s,num in zip(df2['BaseName'].loc[df2['BaseName'] == un2].values, df2['SysNum'].loc[df2['BaseName'] == un2].values)]
df2['Name'].loc[df2['BaseName'] == un2] = newnames2
I have this code working for two dataframes and the numbering works out how I would like it to. The first two have a "Name-###" naming convention for all the rows in the dataframe. This allows the commented out re.search line at the top to run just fine. The next two dataframes I am working on are like the examples I put up earlier with the BN #1 and the rest of the names do not have a number. When I run the commented out re.search lines, the code tries to convert the NoneTypes to int and it cannot do that. When I run the code as is now, a new number is put on each and every row immediately following the name, but I need it to add a new number to the row with the #. So what I need and I am struggling with is a piece of code that looks through the dataframe, looks for a # sign, turns the number after the # sign into an int, a loop that looks for the max int and then adds 1 to that number, adds that new number onto the new dataframe, adds new dataframe onto the old one for a larger master list.
You can access the value on the first row of the Name column using df['Name'].iloc[0].
Thus, you can search for a sequence of digits after a # sign in that value using
m = re.search(r'#(\d+)', df['Name'].iloc[0])
if m:
df['SysNum'] = int(m.group(1)) + 1
Output:
>>> df
ID Name SysNum
0 1 BN #1 2
1 2 HHC 2
2 3 A comp 2
3 4 B Comp 2

In python, how can i use a loop to name panda data frames?

What I'm trying to do is to use pandas to create as many separate data arrays as there are runs of my data set. The approach needs to be vary depending on the data file read in, so I want the run number, the second column, to be used to identify the data and separate it into separate data sets.
So I have a data set that looks like:
1.350000035018e-03 1.000000000000e+00 -1.617387196395e-14
2.850000048056e-03 1.000000000000e+00 -2.752685546875e-06
4.350000061095e-03 1.000000000000e+00 -2.062988281250e-06
(couple hundred lines later)
1.350000035018e-03 2.000000000000e+00 -1.617387196395e-14
2.850000048056e-03 2.000000000000e+00 -2.752685546875e-06
4.350000061095e-03 2.000000000000e+00 -2.062988281250e-06
(however many readings later)
1.350000035018e-03 35.000000000000e+00 -1.617387196395e-14
2.850000048056e-03 35.000000000000e+00 -2.752685546875e-06
4.350000061095e-03 35.000000000000e+00 -2.062988281250e-06
I want to process it into:
data1 = some number 1.0 some number
some number 1.0 some number
data2 = some number 2.0 some number
some number 2.0 some number
datan= some number n some number
some number n some number
So far my code:
f =r'C:~.dat'
#store data using pandas
data = pd.read_csv( f, sep = '\t', comment = '#', names = ['V','n','I'] )
#observe data format
print(data)
V n I
0 0.001350 1.0 -1.617387e-14
1 0.002850 1.0 -2.752686e-06
2 0.004350 1.0 -2.062988e-06
#count the loops for autamted graph plotting
num = 1
for i in range (len(data)):
if i > 0:
if data['n'][i]> data['n'][i-1]:
num = num + 1
#
print('there are '+str(num)+' runs')
#seperate data based on loop #n
for i in range (num):
run = data.groupby(data.n)
data+str(i) = run.get_group(i)
print(data+str(i))
#
using the data grouping method works, but I cant figure out a way to use the loop number as a name variable, any help/suggestions would be highly appreciated?
Do you need to explicitly name your dataframes or can it be part of a list or dict?
For instance, you could do something like this...
import pandas as pd
f =r'C:~.dat'
#store data using pandas
data = pd.read_csv( f, sep = '\t', comment = '#', names = ['V','n','I'] )
data_list = []
# get unique run entries
runs = data["n"].unique()
# save each run's corresponding dataframe into data_list
for run in runs:
data_sub = data[data["n"] == run]
data_list.append(data_sub)
# access it by doing something as follows
for idx, run in enumerate(runs):
print("Working on run {}".format(run))
df_to_operate_on = data_list[idx]
I'm not entirely sure I understand correctly what you're trying to achieve. But if you aim to have data like this:
1.350000035018e-03 1 -1.617387196395e-14
2.850000048056e-03 2 -2.752685546875e-06
4.350000061095e-03 3 -2.062988281250e-06
do you really need the n column?
Isn't that just the data.index + 1?
(the index in your example is [0, 1, 2], and you're looking for [1, 2, 3], so you might be able to do something like data.n = [i + 1 for i in data.index])

Script keep showing "SettingCopyWithWarning'

Hello my problem is that my script keep showing below message
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
downcast=downcast
I Searched the google for a while regarding this, and it seems like my code is somehow
assigning sliced dataframe to new variable, which is problematic.
The problem is ** I can't find where my code get problematic **
I tried copy function, or seperated the nested functions, but it is not working
I attached my code below.
def case_sorting(file_get, col_get, methods_get, operator_get, value_get):
ops = {">": gt, "<": lt}
col_get = str(col_get)
value_get = int(value_get)
if methods_get is "|x|":
new_file = file_get[ops[operator_get](file_get[col_get], value_get)]
else:
new_file = file_get[ops[operator_get](file_get[col_get], np.percentile(file_get[col_get], value_get))]
return new_file
Basically what i was about to do was to make flask api that gets excel file as an input, and returns the csv file with some filtering. So I defined some functions first.
def get_brandlist(df_input, brand_input):
if brand_input == "default":
final_list = (pd.unique(df_input["브랜드"])).tolist()
else:
final_list = brand_input.split("/")
if '브랜드' in final_list:
final_list.remove('브랜드')
final_list = [x for x in final_list if str(x) != 'nan']
return final_list
Then I defined the main function
def select_bestitem(df_data, brand_name, col_name, methods, operator, value):
# // 2-1 // to remove unnecessary rows and columns with na values
df_data = df_data.dropna(axis=0 & 1, how='all')
df_data.fillna(method='pad', inplace=True)
# // 2-2 // iterate over all rows to find which row contains brand value
default_number = 0
for row in df_data.itertuples():
if '브랜드' in row:
df_data.columns = df_data.iloc[default_number, :]
break
else:
default_number = default_number + 1
# // 2-3 // create the list contains all the target brand names
brand_list = get_brandlist(df_input=df_data, brand_input=brand_name)
# // 2-4 // subset the target brand into another dataframe
df_data_refined = df_data[df_data.iloc[:, 1].isin(brand_list)]
# // 2-5 // split the dataframe based on the "brand name", and apply the input condition
df_per_brand = {}
df_per_brand_modified = {}
for brand_each in brand_list:
df_per_brand[brand_each] = df_data_refined[df_data_refined['브랜드'] == brand_each]
file = df_per_brand[brand_each].copy()
df_per_brand_modified[brand_each] = case_sorting(file_get=file, col_get=col_name, methods_get=methods,
operator_get=operator, value_get=value)
# // 2-6 // merge all the remaining dataframe
df_merged = pd.DataFrame()
for brand_each in brand_list:
df_merged = df_merged.append(df_per_brand_modified[brand_each], ignore_index=True)
final_df = df_merged.to_csv(index=False, sep=',', encoding='utf-8')
return final_df
And I am gonna import this function in my app.py later
I am quite new to all the coding, therefore really really sorry if my code is quite hard to understand, but I just really wanted to get rid of this annoying warning message. Thanks for help in advance :)

counting entries yields a wrong dataframe

So I'm trying to automate the process of getting the number of entries a person has by using pandas.
Here's my code:
st = pd.read_csv('list.csv', na_values=['-'])
auto = pd.read_csv('data.csv', na_values=['-'])
comp = st.Component.unique()
eventname = st.EventName.unique()
def get_summary(ID):
for com in comp:
for event in eventname:
arr = []
for ids in ID:
x = len(st.loc[(st.User == str(ids)) & (st.Component == str(com)) & (st.EventName == str(event))])
arr.append(x)
auto.loc[:, event] = pd.Series(arr, index=auto.index)
The output I get looks like this:
I ran some manual loops to see the entries for the first four columns. And I counted them manually too in the csv file. But when I put a print function inside the loop, I can see that it does count the entries correctly, but at some point it gets overwritten with the zero values.
What am I missing/doing wrong here?

Google chart input data

I have a python script to build inputs for a Google chart. It correctly creates column headers and the correct number of rows, but repeats the data for the last row in every row. I tried explicitly setting the row indices rather than using a loop (which wouldn't work in practice, but should have worked in testing). It still gives me the same values for each entry. I also had it working when I had this code on the same page as the HTML user form.
end1 = number of rows in the data table
end2 = number of columns in the data table represented by a list of column headers
viewData = data stored in database
c = connections['default'].cursor()
c.execute("SELECT * FROM {0}.\"{1}\"".format(analysis_schema, viewName))
viewData=c.fetchall()
curDesc = c.description
end1 = len(viewData)
end2 = len(curDesc)
Creates column headers:
colOrder=[curDesc[2][0]]
if activityOrCommodity=="activity":
tableDescription={curDesc[2][0] : ("string", "Activity")}
elif (activityOrCommodity == "commodity") or (activityOrCommodity == "aa_commodity"):
tableDescription={curDesc[2][0] : ("string", "Commodity")}
for i in range(3,end2 ):
attValue = curDesc[i][0]
tableDescription[curDesc[i][0]]= ("number", attValue)
colOrder.append(curDesc[i][0])
Creates row data:
data=[]
values = {}
for i in range(0,end1):
for j in range(2, end2):
if j == 2:
values[curDesc[j][0]] = viewData[i][j].encode("utf-8")
else:
values[curDesc[j][0]] = viewData[i][j]
data.append(values)
dataTable = gviz_api.DataTable(tableDescription)
dataTable.LoadData(data)
return dataTable.ToJSon(columns_order=colOrder)
An example javascript output:
var dt = new google.visualization.DataTable({cols:[{id:'activity',label:'Activity',type:'string'},{id:'size',label:'size',type:'number'},{id:'compositeutility',label:'compositeutility',type:'number'}],rows:[{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]},{c:[{v:'AA26FedGovAccounts'},{v:49118957568.0},{v:1.94956132673}]}]}, 0.6);
it seems you're appending values to the data but your values are not being reset after each iteration...
i assume this is not intended right? if so just move values inside the first for loop in your row setting code

Categories

Resources