Extracting parts of array elements using python - python

I am working to extract all integer values from a specific column (left, top, length and width) in a csv file with multiple rows and columns. I have used pandas to isolate the columns I am interested in but Im stuck on how to use a specific parts of an array.
Let me explain: I need to use the CSV file's column with "left, top, length and width" attributes to then obtain xmin, ymin, xmax and ymax (these are coordinated of boxes in images). Example of a row in this column looks like so:
[{"left":171,"top":0,"width":163,"height":137,"label":"styrofoam container"},{"left":222,"top":42,"width":45,"height":70,"label":"chopstick"}]
And I need to extract the 171, 0, 163 and 137 to do the necessary operations for finding my xmax, xmin, ymax and ymin
The above line is a single row in my pandas array, how do I extract the numbers I need for running my operations?
Here is the code I wrote to extract the column and this is what I have so far:
import os
import csv
import pandas
import numpy as np
csvPath = "/path/of/my/csvfile/csvfile.csv"
data = pandas.read_csv(csvPath)
csv_coords = data['Answer.annotation_data'].values #column with the coordinates
image_name = data ['Input.image_url'].values
print csv_coords[2]

Use:
import ast
d = {'Answer.annotation_data': ['[{"left":171,"top":0,"width":163,"height":137,"label":"styrofoam container"},{"left":222,"top":42,"width":45,"height":70,"label":"chopstick"}]',
'[{"left":170,"top":10,"width":173,"height":157,"label":"styrofoam container"},{"left":222,"top":42,"width":45,"height":70,"label":"chopstick"}]']}
df = pd.DataFrame(d)
print (df)
Answer.annotation_data
0 [{"left":171,"top":0,"width":163,"height":137,...
1 [{"left":170,"top":10,"width":173,"height":157...
#convert string data to list of dicts if necessary
df['Answer.annotation_data'] = df['Answer.annotation_data'].apply(ast.literal_eval)
For each value of cols extract values of dict and return DataFrame, last join together by concat:
def get_val(val):
comb = [[y.get(val, np.nan) for y in x] for x in df['Answer.annotation_data']]
return pd.DataFrame(comb).add_prefix('{}_'.format(val))
cols = ['left','top','width','height']
df1 = pd.concat([get_val(x) for x in cols], axis=1)
print (df1)
left_0 left_1 top_0 top_1 width_0 width_1 height_0 height_1
0 171 222 0 42 163 45 137 70
1 170 222 10 42 173 45 157 70

To access one field in your DataFrame
`data.loc[row][column]` or `data.loc[row,column]`
e.g.
`data.loc[0]['left']
To find, e.g. the minimum of the top values globally
min(data['top'])

Related

Create matrix from indices and value points

I want to read a text file with values of matrix. Let's say you have got a .txt file looking like this:
0 0 4.0
0 1 5.2
0 2 2.1
1 0 2.1
1 1 2.9
1 2 3.1
Here, the first column gives the indices of the matrix on the x-axis and the second column fives the indices of the y-axis. The third column is a value at this position in the matrix. When values are missing the value is just zero.
I am well aware of the fact, that data formats like the .mtx format exist, but I would like to create a scipy sparse matrix or numpy array from this txt file alone instead of adjusting it to the .mtx file format. Is there a Python function out there, which does this for me, which I am missing?
import numpy
with open('filename.txt','r') as f:
lines = f.readlines()
f.close()
data = [i.split(' ') for i in lines]
z = list(zip(*data))
row_indices = list(map(int,z[0]))
column_indices = list(map(int,z[1]))
values = list(map(float,z[2]))
m = max(row_indices)+1
n = max(column_indices)+1
p = max([m,n])
A = numpy.zeros((p,p))
A[row_indices,column_indices]=values
print(A)
If you want a square matrix with maximum of column 1 as the number of rows and and the maximum of column 2 to be the size, then you can remove p = max([m,n]) and replace A = numpy.zeros((p,p)) with A = numpy.zeros((m,n)).
Starting from the array (a) sorted on the first column (major) and second (minor) as in your example, you can reshape:
# a = np.loadtxt('filename')
x = len(np.unique(a[:,0]))
y = len(np.unique(a[:,1]))
a[:,2].reshape(x,y).T
Output:
array([[4. , 2.1],
[5.2, 2.9],
[2.1, 3.1]])

How to perform calculations on csv rows and cols and create new cols using numpy and pandas?

I have a csv that contains 12 cols and 4 rows of data.
As seen in the img
I would like to divide each of those values by their area of which I have created an array, and then multiply by 100 to get a % and have these values in a new column.
Image for array
So for example, A2, A3, A4, will all be divided by 52,600 and then x100.
My current df looks like this dataframe
I interpreted your request for a new column to be a new column for each Sub_* in your spreadsheet, since there were 12 values in your numpy array.
Code edit: I see you wanted to do the math to the 'Baseline' column as well. So I step through each column except the first (which is "Label" and at index 0)
import numpy as np
import pandas as pd
df = pd.read_excel("d:\stack67477476.xlsx")
area_arr = np.array([[52.6, 14.966, 7.702, 4.169, 3.71, 5.648, 6.785, 1.867, 5.268, 4.989, 1.659, 6.538]])
for ii, col in enumerate(df.columns):
if ii == 0:
continue
df[col + "_Area"] = round(df[col] / area_arr[0][ii - 1] * 100, 2)
This is vectorized in one dimension (the 4 rows dimension) but loops through the 12 columns dimension. The output is as follows (don't quote me on this, I may have typed your inputs incorrectly):
df
Label Baseline Sub_A Sub_B Sub_C Sub_D Sub_E Sub_F Sub_G Sub_H Sub_I ... Sub_A_Area Sub_B_Area Sub_C_Area Sub_D_Area Sub_E_Area Sub_F_Area Sub_G_Area Sub_H_Area Sub_I_Area Sub_J_Area Sub_K_Area
0 0 0 15535 5128 8847 10784 5679 20481 8398 10012 5162 ... 103801.95 66580.11 212209.16 290673.85 100548.87 301857.04 449812.53 190053.15 103467.63 275527.43 380177.42
1 1 159506 149454 157456 155680 154327 154671 146863 150761 150446 155335 ... 998623.55 2044352.12 3734228.83 4159757.41 2738509.21 2164524.69 8075040.17 2855846.62 3113549.81 9387040.39 1963949.22
2 2 129087 111918 121515 122066 119557 123813 114746 123140 122156 125480 ... 747815.05 1577707.09 2927944.35 3222560.65 2192156.52 1691171.70 6595607.93 2318830.68 2515133.29 7608679.93 1653533.19
3 3 137562 102318 114509 124641 127442 130324 123331 130392 130715 134528 ... 683669.65 1486743.70 2989709.76 3435094.34 2307436.26 1817700.81 6984038.56 2481302.20 2696492.28 8123206.75 1881890.49
4 4 35901 26488 30836 33756 34549 34000 33269 34071 34151 35149 ... 176987.84 400363.54 809690.57 931239.89 601983.00 490331.61 1824906.27 648272.59 704529.97 2146473.78 531691.65
[5 rows x 25 columns]
Note that it's unclear why your numpy array is 2D, one assumes there is something deeper to that in the rest of your code. Seems it would be clearer to avoid a set of braces:
area_arr = np.array([52.6, 14.966, 7.702, 4.169, 3.71, 5.648, 6.785, 1.867, 5.268, 4.989, 1.659, 6.538])
And simplify the divisor to just:
area_arr[ii] # not area_arr[0][ii]
or for that matter, a simple list would be ok, since numpy isn't needed here.
Apologies if we have miscommunicated on commas and decimal points, but the code still works if you change the numbers.

Is there a faster way to split a pandas dataframe into two complementary parts?

Good evening all,
I have a situation where I need to split a dataframe into two complementary parts based on the value of one feature.
What I mean by this is that for every row in dataframe 1, I need a complementary row in dataframe 2 that takes on the opposite value of that specific feature.
In my source dataframe, the feature I'm referring to is stored under column "773", and it can take on values of either 0.0 or 1.0.
I came up with the following code that does this sufficiently, but it is remarkably slow. It takes about a minute to split 10,000 rows, even on my all-powerful EC2 instance.
data = chunk.iloc[:,1:776]
listy1 = []
listy2 = []
for i in range(0,len(data)):
random_row = data.sample(n=1).iloc[0]
listy1.append(random_row.tolist())
if random_row["773"] == 0.0:
x = data[data["773"] == 1.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
else:
x = data[data["773"] == 0.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
df1 = pd.DataFrame(listy1)
df2 = pd.DataFrame(listy2)
Note: I don't care about duplicate rows, because this data is being used to train a model that compares two objects to tell which one is "better."
Do you have some insight into why this is so slow, or any suggestions as to make this faster?
A key concept in efficient numpy/scipy/pandas coding is using library-shipped vectorized functions whenever possible. Try to process multiple rows at once instead of iterate explicitly over rows. i.e. avoid for loops and .iterrows().
The implementation provided is a little subtle in terms of indexing, but the vectorization thinking should be straightforward as follows:
Draw the main dataset at once.
The complementary dataset: draw the 0-rows at once, the complementary 1-rows at once, and then put them into the corresponding rows at once.
Code:
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(52) # reproducibility
n = 10000
df = pd.DataFrame(
data={
"773": [0,1]*int(n/2),
"dummy1": list(range(n)),
"dummy2": list(range(0, 10*n, 10))
}
)
t0 = datetime.now()
print("Program begins...")
# 1. draw the main dataset
draw_idx = np.random.choice(n, n) # repeatable draw
df_main = df.iloc[draw_idx, :].reset_index(drop=True)
# 2. draw the complementary dataset
# (1) count number of 1's and 0's
n_1 = np.count_nonzero(df["773"][draw_idx].values)
n_0 = n - n_1
# (2) split data for drawing
df_0 = df[df["773"] == 0].reset_index(drop=True)
df_1 = df[df["773"] == 1].reset_index(drop=True)
# (3) draw n_1 indexes in df_0 and n_0 indexes in df_1
idx_0 = np.random.choice(len(df_0), n_1)
idx_1 = np.random.choice(len(df_1), n_0)
# (4) broadcast the drawn rows into the complementary dataset
df_comp = df_main.copy()
mask_0 = (df_main["773"] == 0).values
df_comp.iloc[mask_0 ,:] = df_1.iloc[idx_1, :].values # df_1 into mask_0
df_comp.iloc[~mask_0 ,:] = df_0.iloc[idx_0, :].values # df_0 into ~mask_0
print(f"Program ends in {(datetime.now() - t0).total_seconds():.3f}s...")
Check
print(df_main.head(5))
773 dummy1 dummy2
0 0 28 280
1 1 11 110
2 1 13 130
3 1 23 230
4 0 86 860
print(df_comp.head(5))
773 dummy1 dummy2
0 1 19 190
1 0 74 740
2 0 28 280 <- this row is complementary to df_main
3 0 60 600
4 1 37 370
Efficiency gain: 14.23s -> 0.011s (ca. 128x)

Create Multiple dataframes from a large text file

Using Python, how do I break a text file into data frames where every 84 rows is a new, different dataframe? The first column x_ft is the same value every 84 rows then increments up by 5 ft for the next 84 rows. I need each identical x_ft value and corresponding values in the row for the other two columns (depth_ft and vel_ft_s) to be in the new dataframe too.
My text file is formatted like this:
x_ft depth_ft vel_ft_s
0 270 3535.755 551.735107
1 270 3534.555 551.735107
2 270 3533.355 551.735107
3 270 3532.155 551.735107
4 270 3530.955 551.735107
.
.
33848 2280 3471.334 1093.897339
33849 2280 3470.134 1102.685547
33850 2280 3468.934 1113.144287
33851 2280 3467.734 1123.937134
I have tried many, many different ways but keep running into errors and would really appreciate some help.
I suggest looking into pandas.read_table, which automatically outputs a DataFrame. Once doing so, you can isolate the rows of the DataFrame that you are looking to separate (every 84 rows) by doing something like this:
df = #Read txt datatable with Pandas
arr = []
#This gives you an array of all x values in your dataset
for x in range(0,403):
val = 270+5*x
arr.append(val)
#This generates csv files for every row with a specific x_ft value with its corresponding columns (depth_ft and vel_ft_s)
for x_value in arr:
tempdf = df[(df['x_ft'])] = x_value
tempdf.to_csv("df"+x_value+".csv")
You can get indexes to split your data:
rows = 84
datasets = round(len(data)/rows) # total datasets
index_list = []
for index in data.index:
x = index % rows
if x == 0:
index_list.append(index)
print(index_list)
So, split original dataset by indexes:
l_mod = index_list + [max(index_list)+1]
dfs_list = [data.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
print(len(dfs_list))
Outputs
print(type(dfs_list[1]))
# pandas.core.frame.DataFrame
print(len(dfs_list[0]))
# 84

Adding values from a CSV file

I am beginning to learn python and am struggling with Syntax.
I have a simple CSV file that looks like this
0.01,10,20,0.35,40,50,60,70,80,90,100
2,22,32,42,52,62,72,82,92,102,112
3,33,43,53,63,5647,83,93,103,113,123
I want to look for the highest and lowest value in all the data in the csv file except in the first value of each row.
So effectively the answer here would be
highestValue=5647
lowestValue=0.35
because the data that is looked at is as follows (it ignored the first value of each row)
10,20,0.35,40,50,60,70,80,90,100
22,32,42,52,62,72,82,92,102,112
33,43,53,63,73,5647,93,103,113,123
I would like my code to work for ANY row length.
I really have to admit I'm struggling but here's what I've tried. I usually program PHP so this is all new to me. I have been working on this simple task for a day and can't fathom it out. I think I'm getting confused with terminology 'lists' for example.
import numpy
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
all_values = record.split(',')
maxvalue = np.max(numpy.asfarray(all_values[1:])
print (maxvalue)
With the test data (the CSV file shown at the very top of this question) I would expect the answer to be
highestValue=5647
lowestValue=0.35
If you're using numpy, you can read your csv file as a numpy.ndarray using numpy.genfromtxt() and then use the array's .max() and .min() methods
import numpy
array = numpy.genfromtxt('Anaconda3JamesData/james_test_3.csv', delimiter=',')
array[:, 1:].max()
array[:, 1:].min()
The [:, 1:] part is using numpy's array indexing. It's saying take all the rows (the first [:, part), and for each row take all but the first column (the 1:] part) . This doesn't work with Python's built in lists.
You're overwriting maxvalue each time through the loop, so you're just getting the max value from the last line, not the whole file. You need to compare with the previous maximum.
maxvalue = None
for record in test_data_list:
all_values = record.split(',')
if maxvalue is None:
maxvalue = np.max(numpy.asfarray(all_values[1:])
else:
maxvalue = max(maxvalue, np.max(numpy.asfarray(all_values[1:]))
You do not need the power of numpy for this problem. A simple CSV reader is good enough:
with open("Anaconda3JamesData/james_test_3.csv") as infile:
r = csv.reader(infile)
rows = [list(map(float, line))[1:] for line in r]
max(map(max, rows))
# 5647.0
min(map(min, rows))
# 0.35
I think using numpy is unneeded for this task. First of all, this:
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
can be simplified into this:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
for record in test_data_file:
We can use a list comprehension to read in all of the values:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
values = [float(val) for line in test_data_file for val in line.split(",")[1:]]
values now contains all relevant numbers, so we can just do:
highest_value = max(values)
lowest_value = min(values)
Here's a pandas solution that can give the desired results:
import pandas as pd
df = pd.read_csv('test1.csv', header=None)
# df:
# 0 1 2 3 4 5 6 7 8 9 10
# 0 0.01 10 20 0.35 40 50 60 70 80 90 100
# 1 2.00 22 32 42.00 52 62 72 82 92 102 112
# 2 3.00 33 43 53.00 63 5647 83 93 103 113 123
df = df.iloc[:, 1:]
print("Highest value: {}".format(df.values.max()))
print("Lowest value: {}".format(df.values.min()))
#Output:
Highest value: 5647.0
Lowest value: 0.35

Categories

Resources