Create Multiple dataframes from a large text file

Create Multiple dataframes from a large text file - python

Using Python, how do I break a text file into data frames where every 84 rows is a new, different dataframe? The first column x_ft is the same value every 84 rows then increments up by 5 ft for the next 84 rows. I need each identical x_ft value and corresponding values in the row for the other two columns (depth_ft and vel_ft_s) to be in the new dataframe too.
My text file is formatted like this:
x_ft depth_ft vel_ft_s
0 270 3535.755 551.735107
1 270 3534.555 551.735107
2 270 3533.355 551.735107
3 270 3532.155 551.735107
4 270 3530.955 551.735107
.
.
33848 2280 3471.334 1093.897339
33849 2280 3470.134 1102.685547
33850 2280 3468.934 1113.144287
33851 2280 3467.734 1123.937134
I have tried many, many different ways but keep running into errors and would really appreciate some help.

I suggest looking into pandas.read_table, which automatically outputs a DataFrame. Once doing so, you can isolate the rows of the DataFrame that you are looking to separate (every 84 rows) by doing something like this:
df = #Read txt datatable with Pandas
arr = []
#This gives you an array of all x values in your dataset
for x in range(0,403):
val = 270+5*x
arr.append(val)
#This generates csv files for every row with a specific x_ft value with its corresponding columns (depth_ft and vel_ft_s)
for x_value in arr:
tempdf = df[(df['x_ft'])] = x_value
tempdf.to_csv("df"+x_value+".csv")

You can get indexes to split your data:
rows = 84
datasets = round(len(data)/rows) # total datasets
index_list = []
for index in data.index:
x = index % rows
if x == 0:
index_list.append(index)
print(index_list)
So, split original dataset by indexes:
l_mod = index_list + [max(index_list)+1]
dfs_list = [data.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
print(len(dfs_list))
Outputs
print(type(dfs_list[1]))
# pandas.core.frame.DataFrame
print(len(dfs_list[0]))
# 84

Related

Sum and merge rows in a data frame

I am trying to sum dublicate rows in the amount column like shown in the screenshot:
So if report_name, line_item and column_item are the same I want to sum the amounts in the amount row and create one row instead of two but without losing the structure of the dataframe.
But I don't want to sum dublicates if they have column_item 50 or 30.
This is my data frame:
entity;business_line_group;conso_level_entity;report_name;line_item;column_item;z_axis;value_text;amount;approval_text
456;test;456;C_72_00_a;0050;0010;UNDEFINED;n/a;40409261.0100539;22/03/2022
456;test;456;C_74_00_a;0040;0010;UNDEFINED;n/a;46860662.1948734;22/03/2022
456;test;456;C_74_00_a;0060;0010;UNDEFINED;n/a;1783648.53838003;22/03/2022
456;test;456;C_74_00_a;0070;0010;UNDEFINED;n/a;7847645.76582712;22/03/2022
456;test;456;C_73_00_a;0310;0010;UNDEFINED;n/a;48100909.2077918;22/03/2022
456;test;456;C_74_00_a;0201;0010;UNDEFINED;n/a;45652287.0078367;22/03/2022
456;test;456;C_72_00_a;0590;0010;UNDEFINED;n/a;19988230.281333;22/03/2022
456;test;456;C_73_00_a;0480;0010;UNDEFINED;n/a;28243908.6235795;22/03/2022
456;test;456;C_73_00_a;0490;0010;UNDEFINED;n/a;12655653.8647408;22/03/2022
456;test;456;C_73_00_a;0530;0010;UNDEFINED;n/a;27792100.4510517;22/03/2022
456;test;456;C_73_00_a;0570;0010;UNDEFINED;n/a;20768476.5051213;22/03/2022
456;test;456;C_73_00_a;0480;0010;UNDEFINED;n/a;28601515.4535418;22/03/2022
456;test;456;C_73_00_a;0490;0010;UNDEFINED;n/a;17269663.9202129;22/03/2022
456;test;456;C_73_00_a;0530;0010;UNDEFINED;n/a;21250486.2477187;22/03/2022
456;test;456;C_73_00_a;0570;0010;UNDEFINED;n/a;12924566.8399212;22/03/2022
456;test;456;C_73_00_a;0110;0010;UNDEFINED;n/a;17299383.641137;22/03/2022
456;test;456;C_73_00_a;0035;0010;UNDEFINED;n/a;19054145.8837998;22/03/2022
456;test;456;C_72_00_a;0280;0010;UNDEFINED;n/a;294348.91379545;22/03/2022
456;test;456;C_73_00_a;0340;0010;UNDEFINED;n/a;40803729.9712868;22/03/2022
456;test;456;C_74_00_a;0240;0010;UNDEFINED;n/a;25387904.3875074;22/03/2022
456;test;456;C_73_00_a;0340;0010;UNDEFINED;n/a;6951075.43742419;22/03/2022
456;test;456;C_74_00_a;0240;0010;UNDEFINED;n/a;12298844.1430509;22/03/2022
456;test;456;C_72_00_a;0040;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0050;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0060;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0070;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0090;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0110;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0240;0030;UNDEFINED;n/a;0.85;22/03/2022
456;test;456;C_72_00_a;0260;0030;UNDEFINED;n/a;0.85;22/03/2022
456;test;456;C_72_00_a;0080;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0100;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0120;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0130;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0140;0030;UNDEFINED;n/a;0.95;22/03/2022
456;test;456;C_72_00_a;0150;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0170;0030;UNDEFINED;n/a;0.8;22/03/2022
456;test;456;C_72_00_a;0190;0030;UNDEFINED;n/a;0.93;22/03/2022
456;test;456;C_72_00_a;0200;0030;UNDEFINED;n/a;0.88;22/03/2022
456;test;456;C_72_00_a;0250;0030;UNDEFINED;n/a;0.85;22/03/2022
456;test;456;C_72_00_a;0270;0030;UNDEFINED;n/a;0.85;22/03/2022
456;test;456;C_72_00_a;0280;0030;UNDEFINED;n/a;0.85;22/03/2022
456;test;456;C_72_00_a;0290;0030;UNDEFINED;n/a;0.8;22/03/2022
456;test;456;C_72_00_a;0320;0030;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_72_00_a;0330;0030;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_72_00_a;0340;0030;UNDEFINED;n/a;0.7;22/03/2022
456;test;456;C_72_00_a;0350;0030;UNDEFINED;n/a;0.65;22/03/2022
456;test;456;C_72_00_a;0360;0030;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_72_00_a;0370;0030;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_72_00_a;0380;0030;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_72_00_a;0390;0030;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_72_00_a;0400;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0410;0030;UNDEFINED;n/a;0.7;22/03/2022
456;test;456;C_72_00_a;0420;0030;UNDEFINED;n/a;0.65;22/03/2022
456;test;456;C_72_00_a;0430;0030;UNDEFINED;n/a;0.6;22/03/2022
456;test;456;C_72_00_a;0440;0030;UNDEFINED;n/a;0.45;22/03/2022
456;test;456;C_72_00_a;0450;0030;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_72_00_a;0460;0030;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_73_00_a;0040;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0070;0050;UNDEFINED;n/a;0.15;22/03/2022
456;test;456;C_73_00_a;0090;0050;UNDEFINED;n/a;0.03;22/03/2022
456;test;456;C_73_00_a;0110;0050;UNDEFINED;n/a;0.1;22/03/2022
456;test;456;C_73_00_a;0260;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0310;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0480;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0490;0050;UNDEFINED;n/a;0.1;22/03/2022
456;test;456;C_73_00_a;0530;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0570;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0590;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0080;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0140;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0150;0050;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_73_00_a;0170;0050;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_73_00_a;0190;0050;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_73_00_a;0200;0050;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_73_00_a;0250;0050;UNDEFINED;n/a;0.2;22/03/2022
456;test;456;C_73_00_a;0280;0050;UNDEFINED;n/a;0.2;22/03/2022
456;test;456;C_73_00_a;0290;0050;UNDEFINED;n/a;0.1;22/03/2022
456;test;456;C_73_00_a;0360;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0370;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0380;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0390;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0400;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0420;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0430;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0450;0050;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_73_00_a;0035;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0180;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0204;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0206;0050;UNDEFINED;n/a;0.2;22/03/2022
456;test;456;C_73_00_a;0207;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0220;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0230;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0300;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0510;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0520;0050;UNDEFINED;n/a;0.1;22/03/2022
456;test;456;C_73_00_a;0540;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0560;0050;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_73_00_a;0600;0050;UNDEFINED;n/a;0.3;22/03/2022
456;test;456;C_73_00_a;0610;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0630;0050;UNDEFINED;n/a;0.1;22/03/2022
456;test;456;C_73_00_a;0640;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0660;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0670;0050;UNDEFINED;n/a;0.3;22/03/2022
456;test;456;C_73_00_a;0680;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0700;0050;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_73_00_a;0710;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0890;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0900;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0913;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0914;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0915;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0916;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0917;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0918;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0940;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0950;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0960;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0970;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0980;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0990;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;1000;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;1010;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;1030;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;1040;0050;UNDEFINED;n/a;0.07;22/03/2022
456;test;456;C_73_00_a;1050;0050;UNDEFINED;n/a;0.15;22/03/2022
456;test;456;C_73_00_a;1060;0050;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_73_00_a;1070;0050;UNDEFINED;n/a;0.3;22/03/2022
456;test;456;C_73_00_a;1080;0050;UNDEFINED;n/a;0.35;22/03/2022
456;test;456;C_73_00_a;1090;0050;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_73_00_a;1100;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0040;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0060;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0070;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0090;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0201;0080;UNDEFINED;n/a;0.2;22/03/2022
456;test;456;C_74_00_a;0260;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0080;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0130;0080;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_74_00_a;0150;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0170;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0190;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0180;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0230;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0160;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0210;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0269;0080;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_74_00_a;0273;0080;UNDEFINED;n/a;0.07;22/03/2022
456;test;456;C_74_00_a;0277;0080;UNDEFINED;n/a;0.15;22/03/2022
456;test;456;C_74_00_a;0281;0080;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_74_00_a;0285;0080;UNDEFINED;n/a;0.3;22/03/2022
456;test;456;C_74_00_a;0289;0080;UNDEFINED;n/a;0.35;22/03/2022
456;test;456;C_74_00_a;0293;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0301;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0303;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0309;0080;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_74_00_a;0313;0080;UNDEFINED;n/a;0.07;22/03/2022
456;test;456;C_74_00_a;0317;0080;UNDEFINED;n/a;0.15;22/03/2022
456;test;456;C_74_00_a;0321;0080;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_74_00_a;0325;0080;UNDEFINED;n/a;0.3;22/03/2022
456;test;456;C_74_00_a;0329;0080;UNDEFINED;n/a;0.35;22/03/2022
456;test;456;C_74_00_a;0333;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0341;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0343;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0345;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0070;0010;UNDEFINED;n/a;5198630.14;22/03/2022
456;test;456;C_72_00_a;0190;0010;UNDEFINED;n/a;835892217.0;22/03/2022
456;test;456;C_72_00_a;0260;0010;UNDEFINED;n/a;4745984333.0;22/03/2022
456;test;456;C_73_00_a;0035;0010;UNDEFINED;n/a;25424822307.28;22/03/2022
456;test;456;C_73_00_a;0070;0010;UNDEFINED;n/a;-33216232069.67;22/03/2022
456;test;456;C_73_00_a;0080;0010;UNDEFINED;n/a;-20966122130.53;22/03/2022
456;test;456;C_73_00_a;0110;0010;UNDEFINED;n/a;-9384698955.8;22/03/2022
456;test;456;C_73_00_a;0230;0010;UNDEFINED;n/a;2193605666.84;22/03/2022
456;test;456;C_73_00_a;0250;0010;UNDEFINED;n/a;-573769151.28;22/03/2022
456;test;456;C_73_00_a;0260;0010;UNDEFINED;n/a;3333715453.55;22/03/2022
456;test;456;C_73_00_a;0918;0010;UNDEFINED;n/a;124366.0;22/03/2022
456;test;456;C_74_00_a;0160;0010;UNDEFINED;n/a;-54345799619.07;22/03/2022
456;test;456;C_74_00_a;0260;0010;UNDEFINED;n/a;150348.16;22/03/2022
456;test;456;C_73_00_a;1100;0010;UNDEFINED;n/a;-37633449687.15;22/03/2022
456;test;456;C_73_00_a;1100;0020;UNDEFINED;n/a;-3764349687.15;22/03/2022
456;test;456;C_73_00_a;1040;0040;UNDEFINED;n/a;33764349687.15;22/03/2022
456;test;456;C_73_00_a;1045;0040;UNDEFINED;n/a;33764349687.15;22/03/2022
456;test;456;C_73_00_a;1045;0030;UNDEFINED;n/a;335098209.05;22/03/2022
456;test;456;C_73_00_a;1040;0010;UNDEFINED;n/a;7449687.15;22/03/2022
456;test;456;C_73_00_a;1045;0010;UNDEFINED;n/a;76449687.15;22/03/2022
456;test;456;C_72_00_a;0050;0010;UNDEFINED;n/a;40409261.0100539;22/03/2022
456;test;456;C_74_00_a;0040;0010;UNDEFINED;n/a;46860662.1948734;22/03/2022
456;test;456;C_74_00_a;0060;0010;UNDEFINED;n/a;1783648.53838003;22/03/2022
456;test;456;C_74_00_a;0070;0010;UNDEFINED;n/a;7847645.76582712;22/03/2022
456;test;456;C_73_00_a;0310;0010;UNDEFINED;n/a;48100909.2077918;22/03/2022
456;test;456;C_74_00_a;0201;0010;UNDEFINED;n/a;45652287.0078367;22/03/2022
456;test;456;C_72_00_a;0590;0010;UNDEFINED;n/a;19988230.281333;22/03/2022
456;test;456;C_73_00_a;0480;0010;UNDEFINED;n/a;28243908.6235795;22/03/2022
456;test;456;C_73_00_a;0490;0010;UNDEFINED;n/a;12655653.8647408;22/03/2022
456;test;456;C_73_00_a;0530;0010;UNDEFINED;n/a;27792100.4510517;22/03/2022
456;test;456;C_73_00_a;0570;0010;UNDEFINED;n/a;20768476.5051213;22/03/2022
456;test;456;C_73_00_a;0480;0010;UNDEFINED;n/a;28601515.4535418;22/03/2022
456;test;456;C_73_00_a;0490;0010;UNDEFINED;n/a;17269663.9202129;22/03/2022
456;test;456;C_73_00_a;0530;0010;UNDEFINED;n/a;21250486.2477187;22/03/2022
456;test;456;C_73_00_a;0570;0010;UNDEFINED;n/a;12924566.8399212;22/03/2022
456;test;456;C_73_00_a;0110;0010;UNDEFINED;n/a;17299383.641137;22/03/2022
456;test;456;C_73_00_a;0035;0010;UNDEFINED;n/a;19054145.8837998;22/03/2022
456;test;456;C_72_00_a;0280;0010;UNDEFINED;n/a;294348.91379545;22/03/2022
456;test;456;C_73_00_a;0340;0010;UNDEFINED;n/a;40803729.9712868;22/03/2022
456;test;456;C_74_00_a;0240;0010;UNDEFINED;n/a;25387904.3875074;22/03/2022
456;test;456;C_73_00_a;0340;0010;UNDEFINED;n/a;6951075.43742419;22/03/2022
456;test;456;C_74_00_a;0240;0010;UNDEFINED;n/a;12298844.1430509;22/03/2022
456;test;456;C_72_00_a;0040;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0050;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0060;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0070;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0090;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0110;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0240;0030;UNDEFINED;n/a;0.85;22/03/2022
456;test;456;C_72_00_a;0260;0030;UNDEFINED;n/a;0.85;22/03/2022
456;test;456;C_72_00_a;0080;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0100;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0120;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0130;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0140;0030;UNDEFINED;n/a;0.95;22/03/2022
456;test;456;C_72_00_a;0150;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0170;0030;UNDEFINED;n/a;0.8;22/03/2022
456;test;456;C_72_00_a;0190;0030;UNDEFINED;n/a;0.93;22/03/2022
456;test;456;C_72_00_a;0200;0030;UNDEFINED;n/a;0.88;22/03/2022
456;test;456;C_72_00_a;0250;0030;UNDEFINED;n/a;0.85;22/03/2022
456;test;456;C_72_00_a;0270;0030;UNDEFINED;n/a;0.85;22/03/2022
456;test;456;C_72_00_a;0280;0030;UNDEFINED;n/a;0.85;22/03/2022
456;test;456;C_72_00_a;0290;0030;UNDEFINED;n/a;0.8;22/03/2022
456;test;456;C_72_00_a;0320;0030;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_72_00_a;0330;0030;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_72_00_a;0340;0030;UNDEFINED;n/a;0.7;22/03/2022
456;test;456;C_72_00_a;0350;0030;UNDEFINED;n/a;0.65;22/03/2022
456;test;456;C_72_00_a;0360;0030;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_72_00_a;0370;0030;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_72_00_a;0380;0030;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_72_00_a;0390;0030;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_72_00_a;0400;0030;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0410;0030;UNDEFINED;n/a;0.7;22/03/2022
456;test;456;C_72_00_a;0420;0030;UNDEFINED;n/a;0.65;22/03/2022
456;test;456;C_72_00_a;0430;0030;UNDEFINED;n/a;0.6;22/03/2022
456;test;456;C_72_00_a;0440;0030;UNDEFINED;n/a;0.45;22/03/2022
456;test;456;C_72_00_a;0450;0030;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_72_00_a;0460;0030;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_73_00_a;0040;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0070;0050;UNDEFINED;n/a;0.15;22/03/2022
456;test;456;C_73_00_a;0090;0050;UNDEFINED;n/a;0.03;22/03/2022
456;test;456;C_73_00_a;0110;0050;UNDEFINED;n/a;0.1;22/03/2022
456;test;456;C_73_00_a;0260;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0310;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0480;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0490;0050;UNDEFINED;n/a;0.1;22/03/2022
456;test;456;C_73_00_a;0530;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0570;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0590;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0080;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0140;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0150;0050;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_73_00_a;0170;0050;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_73_00_a;0190;0050;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_73_00_a;0200;0050;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_73_00_a;0250;0050;UNDEFINED;n/a;0.2;22/03/2022
456;test;456;C_73_00_a;0280;0050;UNDEFINED;n/a;0.2;22/03/2022
456;test;456;C_73_00_a;0290;0050;UNDEFINED;n/a;0.1;22/03/2022
456;test;456;C_73_00_a;0360;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0370;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0380;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0390;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0400;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0420;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0430;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0450;0050;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_73_00_a;0035;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0180;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0204;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0206;0050;UNDEFINED;n/a;0.2;22/03/2022
456;test;456;C_73_00_a;0207;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0220;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0230;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0300;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0510;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0520;0050;UNDEFINED;n/a;0.1;22/03/2022
456;test;456;C_73_00_a;0540;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0560;0050;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_73_00_a;0600;0050;UNDEFINED;n/a;0.3;22/03/2022
456;test;456;C_73_00_a;0610;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0630;0050;UNDEFINED;n/a;0.1;22/03/2022
456;test;456;C_73_00_a;0640;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0660;0050;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_73_00_a;0670;0050;UNDEFINED;n/a;0.3;22/03/2022
456;test;456;C_73_00_a;0680;0050;UNDEFINED;n/a;0.4;22/03/2022
456;test;456;C_73_00_a;0700;0050;UNDEFINED;n/a;0.75;22/03/2022
456;test;456;C_73_00_a;0710;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0890;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0900;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0913;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0914;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0915;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0916;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0917;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0918;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_73_00_a;0940;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0950;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0960;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0970;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0980;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;0990;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;1000;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;1010;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;1030;0050;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_73_00_a;1040;0050;UNDEFINED;n/a;0.07;22/03/2022
456;test;456;C_73_00_a;1050;0050;UNDEFINED;n/a;0.15;22/03/2022
456;test;456;C_73_00_a;1060;0050;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_73_00_a;1070;0050;UNDEFINED;n/a;0.3;22/03/2022
456;test;456;C_73_00_a;1080;0050;UNDEFINED;n/a;0.35;22/03/2022
456;test;456;C_73_00_a;1090;0050;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_73_00_a;1100;0050;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0040;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0060;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0070;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0090;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0201;0080;UNDEFINED;n/a;0.2;22/03/2022
456;test;456;C_74_00_a;0260;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0080;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0130;0080;UNDEFINED;n/a;0.05;22/03/2022
456;test;456;C_74_00_a;0150;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0170;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0190;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0180;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0230;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0160;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0210;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0269;0080;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_74_00_a;0273;0080;UNDEFINED;n/a;0.07;22/03/2022
456;test;456;C_74_00_a;0277;0080;UNDEFINED;n/a;0.15;22/03/2022
456;test;456;C_74_00_a;0281;0080;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_74_00_a;0285;0080;UNDEFINED;n/a;0.3;22/03/2022
456;test;456;C_74_00_a;0289;0080;UNDEFINED;n/a;0.35;22/03/2022
456;test;456;C_74_00_a;0293;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0301;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0303;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0309;0080;UNDEFINED;n/a;0.0;22/03/2022
456;test;456;C_74_00_a;0313;0080;UNDEFINED;n/a;0.07;22/03/2022
456;test;456;C_74_00_a;0317;0080;UNDEFINED;n/a;0.15;22/03/2022
456;test;456;C_74_00_a;0321;0080;UNDEFINED;n/a;0.25;22/03/2022
456;test;456;C_74_00_a;0325;0080;UNDEFINED;n/a;0.3;22/03/2022
456;test;456;C_74_00_a;0329;0080;UNDEFINED;n/a;0.35;22/03/2022
456;test;456;C_74_00_a;0333;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0341;0080;UNDEFINED;n/a;0.5;22/03/2022
456;test;456;C_74_00_a;0343;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_74_00_a;0345;0080;UNDEFINED;n/a;1.0;22/03/2022
456;test;456;C_72_00_a;0070;0010;UNDEFINED;n/a;5198630.14;22/03/2022
456;test;456;C_72_00_a;0190;0010;UNDEFINED;n/a;835892217.0;22/03/2022
456;test;456;C_72_00_a;0260;0010;UNDEFINED;n/a;4745984333.0;22/03/2022
456;test;456;C_73_00_a;0035;0010;UNDEFINED;n/a;25424822307.28;22/03/2022
456;test;456;C_73_00_a;0070;0010;UNDEFINED;n/a;-33216232069.67;22/03/2022
456;test;456;C_73_00_a;0080;0010;UNDEFINED;n/a;-20966122130.53;22/03/2022
456;test;456;C_73_00_a;0110;0010;UNDEFINED;n/a;-9384698955.8;22/03/2022
456;test;456;C_73_00_a;0230;0010;UNDEFINED;n/a;2193605666.84;22/03/2022
456;test;456;C_73_00_a;0250;0010;UNDEFINED;n/a;-573769151.28;22/03/2022
456;test;456;C_73_00_a;0260;0010;UNDEFINED;n/a;3333715453.55;22/03/2022
456;test;456;C_73_00_a;0918;0010;UNDEFINED;n/a;124366.0;22/03/2022
456;test;456;C_74_00_a;0160;0010;UNDEFINED;n/a;-54345799619.07;22/03/2022
456;test;456;C_74_00_a;0260;0010;UNDEFINED;n/a;150348.16;22/03/2022
456;test;456;C_73_00_a;1100;0010;UNDEFINED;n/a;-37633449687.15;22/03/2022
456;test;456;C_73_00_a;1100;0020;UNDEFINED;n/a;-3764349687.15;22/03/2022
456;test;456;C_73_00_a;1040;0040;UNDEFINED;n/a;33764349687.15;22/03/2022
456;test;456;C_73_00_a;1045;0040;UNDEFINED;n/a;33764349687.15;22/03/2022
456;test;456;C_73_00_a;1045;0030;UNDEFINED;n/a;335098209.05;22/03/2022
456;test;456;C_73_00_a;1040;0010;UNDEFINED;n/a;7449687.15;22/03/2022
456;test;456;C_73_00_a;1045;0010;UNDEFINED;n/a;76449687.15;22/03/2022
I hope you can lead me in the right direction.

Because need omit sum values by condition first filter for not match condition, get sum with remove duplicates and then add rows by condition:
m = df['column_item'].isin([30, 50])
df1 = df[~m].copy()
df1['amount'] = df1.groupby(['report_name', 'line_item', 'column_item'])['amount'].transform('sum')
df1 = df1.drop_duplicates(['report_name', 'line_item', 'column_item'])
df = pd.concat([df1, df[m]])

If you need to get just duplicated rows and sum over them, you can do something like:
(df[(df[["report_name", "line_item","column_item"]].duplicated(keep=False)) & (~df['column_item'].isin([30, 50]))]
.groupby(["report_name", "line_item","column_item"])["amount"]
.sum())
This will result in something like:
report_name line_item column_item
C_72_00_a 50 10 4.040926e+07
70 10 5.198630e+06
190 10 8.358922e+08
260 10 4.745984e+09
280 10 2.943489e+05
...
C_74_00_a 329 80 3.500000e-01
333 80 5.000000e-01
341 80 5.000000e-01
343 80 1.000000e+00
345 80 1.000000e+00
Name: amount, Length: 67, dtype: float64
To make sure that you are getting the correct values let's check the example you have shown in your question ( the one with C_73_00_a and 1100 and 10):
dfResult = (df[(df[["report_name", "line_item","column_item"]].duplicated(keep=False)) & (~df['column_item'].isin([30, 50]))]
.groupby(["report_name", "line_item","column_item"])["amount"]
.sum())
dfResult[('C_73_00_a', 1100, 10)]
This will output:
-75266899374.3
Which is the result of -37633449687.15 + -37633449687.15 (as shown in your question).

Is there a faster way to split a pandas dataframe into two complementary parts?

Good evening all,
I have a situation where I need to split a dataframe into two complementary parts based on the value of one feature.
What I mean by this is that for every row in dataframe 1, I need a complementary row in dataframe 2 that takes on the opposite value of that specific feature.
In my source dataframe, the feature I'm referring to is stored under column "773", and it can take on values of either 0.0 or 1.0.
I came up with the following code that does this sufficiently, but it is remarkably slow. It takes about a minute to split 10,000 rows, even on my all-powerful EC2 instance.
data = chunk.iloc[:,1:776]
listy1 = []
listy2 = []
for i in range(0,len(data)):
random_row = data.sample(n=1).iloc[0]
listy1.append(random_row.tolist())
if random_row["773"] == 0.0:
x = data[data["773"] == 1.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
else:
x = data[data["773"] == 0.0].sample(n=1).iloc[0]
listy2.append(x.tolist())
df1 = pd.DataFrame(listy1)
df2 = pd.DataFrame(listy2)
Note: I don't care about duplicate rows, because this data is being used to train a model that compares two objects to tell which one is "better."
Do you have some insight into why this is so slow, or any suggestions as to make this faster?

A key concept in efficient numpy/scipy/pandas coding is using library-shipped vectorized functions whenever possible. Try to process multiple rows at once instead of iterate explicitly over rows. i.e. avoid for loops and .iterrows().
The implementation provided is a little subtle in terms of indexing, but the vectorization thinking should be straightforward as follows:
Draw the main dataset at once.
The complementary dataset: draw the 0-rows at once, the complementary 1-rows at once, and then put them into the corresponding rows at once.
Code:
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(52) # reproducibility
n = 10000
df = pd.DataFrame(
data={
"773": [0,1]*int(n/2),
"dummy1": list(range(n)),
"dummy2": list(range(0, 10*n, 10))
}
)
t0 = datetime.now()
print("Program begins...")
# 1. draw the main dataset
draw_idx = np.random.choice(n, n) # repeatable draw
df_main = df.iloc[draw_idx, :].reset_index(drop=True)
# 2. draw the complementary dataset
# (1) count number of 1's and 0's
n_1 = np.count_nonzero(df["773"][draw_idx].values)
n_0 = n - n_1
# (2) split data for drawing
df_0 = df[df["773"] == 0].reset_index(drop=True)
df_1 = df[df["773"] == 1].reset_index(drop=True)
# (3) draw n_1 indexes in df_0 and n_0 indexes in df_1
idx_0 = np.random.choice(len(df_0), n_1)
idx_1 = np.random.choice(len(df_1), n_0)
# (4) broadcast the drawn rows into the complementary dataset
df_comp = df_main.copy()
mask_0 = (df_main["773"] == 0).values
df_comp.iloc[mask_0 ,:] = df_1.iloc[idx_1, :].values # df_1 into mask_0
df_comp.iloc[~mask_0 ,:] = df_0.iloc[idx_0, :].values # df_0 into ~mask_0
print(f"Program ends in {(datetime.now() - t0).total_seconds():.3f}s...")
Check
print(df_main.head(5))
773 dummy1 dummy2
0 0 28 280
1 1 11 110
2 1 13 130
3 1 23 230
4 0 86 860
print(df_comp.head(5))
773 dummy1 dummy2
0 1 19 190
1 0 74 740
2 0 28 280 <- this row is complementary to df_main
3 0 60 600
4 1 37 370
Efficiency gain: 14.23s -> 0.011s (ca. 128x)

Adding values from a CSV file

I am beginning to learn python and am struggling with Syntax.
I have a simple CSV file that looks like this
0.01,10,20,0.35,40,50,60,70,80,90,100
2,22,32,42,52,62,72,82,92,102,112
3,33,43,53,63,5647,83,93,103,113,123
I want to look for the highest and lowest value in all the data in the csv file except in the first value of each row.
So effectively the answer here would be
highestValue=5647
lowestValue=0.35
because the data that is looked at is as follows (it ignored the first value of each row)
10,20,0.35,40,50,60,70,80,90,100
22,32,42,52,62,72,82,92,102,112
33,43,53,63,73,5647,93,103,113,123
I would like my code to work for ANY row length.
I really have to admit I'm struggling but here's what I've tried. I usually program PHP so this is all new to me. I have been working on this simple task for a day and can't fathom it out. I think I'm getting confused with terminology 'lists' for example.
import numpy
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
all_values = record.split(',')
maxvalue = np.max(numpy.asfarray(all_values[1:])
print (maxvalue)
With the test data (the CSV file shown at the very top of this question) I would expect the answer to be
highestValue=5647
lowestValue=0.35

If you're using numpy, you can read your csv file as a numpy.ndarray using numpy.genfromtxt() and then use the array's .max() and .min() methods
import numpy
array = numpy.genfromtxt('Anaconda3JamesData/james_test_3.csv', delimiter=',')
array[:, 1:].max()
array[:, 1:].min()
The [:, 1:] part is using numpy's array indexing. It's saying take all the rows (the first [:, part), and for each row take all but the first column (the 1:] part) . This doesn't work with Python's built in lists.

You're overwriting maxvalue each time through the loop, so you're just getting the max value from the last line, not the whole file. You need to compare with the previous maximum.
maxvalue = None
for record in test_data_list:
all_values = record.split(',')
if maxvalue is None:
maxvalue = np.max(numpy.asfarray(all_values[1:])
else:
maxvalue = max(maxvalue, np.max(numpy.asfarray(all_values[1:]))

You do not need the power of numpy for this problem. A simple CSV reader is good enough:
with open("Anaconda3JamesData/james_test_3.csv") as infile:
r = csv.reader(infile)
rows = [list(map(float, line))[1:] for line in r]
max(map(max, rows))
# 5647.0
min(map(min, rows))
# 0.35

I think using numpy is unneeded for this task. First of all, this:
test_data_file = open ("Anaconda3JamesData/james_test_3.csv","r")
test_data_list = test_data_file.readlines()
test_data_file.close()
for record in test_data_list:
can be simplified into this:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
for record in test_data_file:
We can use a list comprehension to read in all of the values:
with open("Anaconda3JamesData/james_test_3.csv","r") as test_data_file:
values = [float(val) for line in test_data_file for val in line.split(",")[1:]]
values now contains all relevant numbers, so we can just do:
highest_value = max(values)
lowest_value = min(values)

Here's a pandas solution that can give the desired results:
import pandas as pd
df = pd.read_csv('test1.csv', header=None)
# df:
# 0 1 2 3 4 5 6 7 8 9 10
# 0 0.01 10 20 0.35 40 50 60 70 80 90 100
# 1 2.00 22 32 42.00 52 62 72 82 92 102 112
# 2 3.00 33 43 53.00 63 5647 83 93 103 113 123
df = df.iloc[:, 1:]
print("Highest value: {}".format(df.values.max()))
print("Lowest value: {}".format(df.values.min()))
#Output:
Highest value: 5647.0
Lowest value: 0.35

Extracting parts of array elements using python

I am working to extract all integer values from a specific column (left, top, length and width) in a csv file with multiple rows and columns. I have used pandas to isolate the columns I am interested in but Im stuck on how to use a specific parts of an array.
Let me explain: I need to use the CSV file's column with "left, top, length and width" attributes to then obtain xmin, ymin, xmax and ymax (these are coordinated of boxes in images). Example of a row in this column looks like so:
[{"left":171,"top":0,"width":163,"height":137,"label":"styrofoam container"},{"left":222,"top":42,"width":45,"height":70,"label":"chopstick"}]
And I need to extract the 171, 0, 163 and 137 to do the necessary operations for finding my xmax, xmin, ymax and ymin
The above line is a single row in my pandas array, how do I extract the numbers I need for running my operations?
Here is the code I wrote to extract the column and this is what I have so far:
import os
import csv
import pandas
import numpy as np
csvPath = "/path/of/my/csvfile/csvfile.csv"
data = pandas.read_csv(csvPath)
csv_coords = data['Answer.annotation_data'].values #column with the coordinates
image_name = data ['Input.image_url'].values
print csv_coords[2]

Use:
import ast
d = {'Answer.annotation_data': ['[{"left":171,"top":0,"width":163,"height":137,"label":"styrofoam container"},{"left":222,"top":42,"width":45,"height":70,"label":"chopstick"}]',
'[{"left":170,"top":10,"width":173,"height":157,"label":"styrofoam container"},{"left":222,"top":42,"width":45,"height":70,"label":"chopstick"}]']}
df = pd.DataFrame(d)
print (df)
Answer.annotation_data
0 [{"left":171,"top":0,"width":163,"height":137,...
1 [{"left":170,"top":10,"width":173,"height":157...
#convert string data to list of dicts if necessary
df['Answer.annotation_data'] = df['Answer.annotation_data'].apply(ast.literal_eval)
For each value of cols extract values of dict and return DataFrame, last join together by concat:
def get_val(val):
comb = [[y.get(val, np.nan) for y in x] for x in df['Answer.annotation_data']]
return pd.DataFrame(comb).add_prefix('{}_'.format(val))
cols = ['left','top','width','height']
df1 = pd.concat([get_val(x) for x in cols], axis=1)
print (df1)
left_0 left_1 top_0 top_1 width_0 width_1 height_0 height_1
0 171 222 0 42 163 45 137 70
1 170 222 10 42 173 45 157 70

To access one field in your DataFrame
`data.loc[row][column]` or `data.loc[row,column]`
e.g.
`data.loc[0]['left']
To find, e.g. the minimum of the top values globally
min(data['top'])

Read and return values from array with conditions

I am trying to read from array i created and return value inside array from the column and row its found in.this is what i have at the moment.
import pandas as pd
import os
import re
Dir = os.getcwd()
Blks = []
for files in Dir:
for f in os.listdir(Dir):
if re.search('txt', f):
Blks = [each for each in os.listdir(Dir) if each.endswith('.txt')]
print (Blks)
for z in Blks:
df = pd.read_csv(z, sep=r'\s+', names=['x','y','z'])
a = []
a = df.pivot('y','x','z')
print (a)
OUTPUTS:
x 300.00 300.25 300.50 300.75 301.00 301.25 301.50 301.75
y
200.00 100 100 100 100 100 100 100 100
200.25 100 100 100 100 110 100 100 100
200.50 100 100 100 100 100 100 100 100
x will be my columns and y the rows, inside the array is values corresponding to there adjacent column and row. as you can see above there is a odd 110 value that is 10 above the other values, i'm trying to read the array and return the x (column) and y (row) value for the value that's 10 difference by checking its values next to it(top,bottom,right,left) to calculate the difference.
Hope someone can kindly guide me into right direction, and any beginner tips are appreciated.if its unclear what i'm asking please ask i don't have years experience in all methodology,i have only recently started with python .

You could use DataFrame.ix to loop through all the values, row by row and column by column.
oddCoordinates=[]
for r in df.shape[0]:
for c in df.shape[1]:
if checkDiffFromNeighbors(df,r,c):
oddCoordinates.append((r,c))
The row and column of values that are different from the neighbors are listed in oddCoordinates.
To check the difference between the neighbors, you could loop them and count how many different values there are:
def checkDiffFromNeighbors(df,r,c):
#counter of how many different
diffCnt = 0
#loop over the neighbor rows
for dr in [-1,0,1]:
r1 = r+dr
#row should be a valid number
if r1>=0 and r1<df.shape[0]:
#loop over columns in row
for dc in [-1,0,1]:
#either row or column delta should be 0, because we do not allow diagonal
if dr==0 or dc==0:
c1 = c+dc
#check legal column
if c1>=0 and c1<df.shape[1]:
if df.ix[r,c]!=df.ix[r1,c1]:
diffCnt += 1
# if diffCnt==1 then a neighbor is (probably) the odd one
# otherwise this one is the odd one
# Note that you could be more strict and require all neighbors to be different
if diffCnt>1:
return True
else:
return False

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create Multiple dataframes from a large text file - python

Related

Sum and merge rows in a data frame

Is there a faster way to split a pandas dataframe into two complementary parts?

Adding values from a CSV file

Extracting parts of array elements using python

Read and return values from array with conditions

Categories

Resources