Select multiple columns within a certain range of another column

Select multiple columns within a certain range of another column - python

1 0 0 0.579322
2 0 0 0.579306
3 0 0 0.279274
4 5 0 0.579224
5 3 0 0.579157
3 0 0 0.47907
7 0 1 0.378963
8 9 0 0.578833
I'm a beginner in python and struggling to do this. I have four columns like above mentioned, I need to save 1,2,3 columns which have the value greater than 0.4 and less than 0.5 in column 4. Can this be done via numpy?
This is the code I tried.
import csv
csv_out = csv.writer(open('data_new.csv', 'w'), delimiter=',')
f = open('coordiantes.txt',"w+")
for line in f:
vals = line.split('\t')
for vals ([3]>=0.4 & vals[3]<=0.5):
print vals[0],vals[1],vals[2]
csv_out.writerow(vals[0], vals[1], vals[2],vals[3])
f.close()

Can be done with a few built in numpy functions
vals = #your array
#do a Boolean index of your array where the fourth column meets your criteria
vals = vals[np.where((vals[:,3] <=0.5)&(vals[:,3]>0.4))]
#use numpy to slice off last column and to save the file
np.savetxt('coordiantes.txt',vals[:,:3],delimiter=',')

You can do the following:
import numpy as np
data = np.loadtxt('coordinates.txt')
idx = np.where((data[:,3] <= 0.5) & (data[:,3] > 0.4))[0] # save where col 4's data is in (0.4,0.5]
selected_data = data[idx,:3] # get the 1st three cols for the rows of interest
np.savetxt('data_new.csv', selected_data, delimiter=',')

Related

Converting a 1D list into a 2D DataFrame

I have scraped a webpage table, and the table items are in a sequential 1D list, with repeated headers. I want to reconstitute the table into a DataFrame.
I have an algorithm to do this, but I'd like to know if there is a more pythonic/efficient way to achieve this? NB. I don't necessarily know how many columns there are in my table. Here's an example:
input = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
output = {}
it = iter(input)
val = next(it)
while val:
if val in output:
output[val].append(next(it))
else:
output[val] = [next(it)]
val = next(it,None)
df = pd.DataFrame(output)
print(df)
with the result:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12

If your data is always "well behaved", then something like this should suffice:
import pandas as pd
data = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
result = {}
for k,v in zip(data[::2], data[1::2]):
result.setdefault(k, []).append(v)
df = pd.DataFrame(output)

You can also use numpy reshape:
import numpy as np
cols = sorted(set(l[::2]))
df = pd.DataFrame(np.reshape(l, (int(len(l)/len(cols)/2), len(cols)*2)).T[1::2].T, columns=cols)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Explaination:
# get columns
cols = sorted(set(l[::2]))
# reshape list into list of lists
shape = (int(len(l)/len(cols)/2), len(cols)*2)
np.reshape(l, shape)
# get only the values of the data
.T[1::2].T
# this transposes the data and slices every second step

how to write a list in a file with a specific format?

I have a Python list and wanna reprint that in a special way.
input:
trend_end= ['skill1',10,0,13,'skill2',6,1,0,'skill3',5,8,9,'skill4',9,0,1]
I want to write a file like this:
output:
1 2 3
1 10 0 13
2 6 1 0
3 5 8 9
4 9 0 1
Basically, I need to do the following steps:
Separate elements of the list for each skill.
Write them in a table shape, add indices of columns and rows.
I wanna use it as an input of another software. That's why I wanna write a file.
I did this but I know it is wrong, can you see how I can fix it?
f1 = open("data.txt", "a")
for j in trend_end:
f1.write(str(j))
for i in range(1,int(len(trend_end)/df1ana.shape[0])):
G=[trend_end[i*(df1ana.shape[0]-10)- (df1ana.shape[0]-10):i*(df1ana.shape[0]-10)]]
for h in G:
f1.write(i)
f1.write(h)
f1.write('\n')
f.close()
df1ana.shape[0] is 3 in the above example. It is basically the length of data for each skill

Another option that you can try via pandas:
import pandas as pd
pd.DataFrame([trend_end[i+1:i+4] for i in range(0,len(trend_end),4)]).to_csv('data.txt', sep='\t')
OUTPUT:
0 1 2
0 10 0 13
1 6 1 0
2 5 8 9
3 9 0 1

You should iterate over the list in steps of 4, i.e. df1ana.shape[0]+1
steps = df1ana.shape[0]+1
with open("data.txt", "a") as f:
f.write(' ' + ' '.join(range(1, steps)) + '\n') # write header line
for i in range(1, len(trend_end), steps):
f.write(f"{i:<3}")
for j in range(i, i+steps-1):
f.write("f{trend_end[j]:<3}")
f.write("\n")
The :<3 formatting puts each value in a 3-character, left-aligned field.

This should work regardless of the number of groups or the number of records per group. It uses the difference in the size of the full list compared to the integer only list to calculate the number of rows you should have, and uses the ratio of the number of integers over the number of rows to get the number of columns.
import numpy as np
import pandas as pd
digits = [x for x in trend if isinstance(x,int)]
pd.DataFrame(np.reshape(digits,
(int(len(trend)-len(digits)),
int(len(digits)/(len(trend)-len(digits)))))).to_csv('output.csv')

get a random item from a group of rows in a xlsx file in python

I have a xlsx file, for example:
A B C D E F G
1 5 2 7 0 1 8
3 4 0 7 8 5 9
4 2 9 7 0 6 2
1 6 3 2 8 8 0
4 3 5 2 5 7 9
5 2 3 2 6 9 1
being my values (that are actually on an excel file).
I nedd to get random rows of it, but separeted for column D values.
You can note that column D has values that are 7 and values that are 2.
I need to get 1 random row of all the rows that have 7 on column D and 1 random row of all the rows that have 2 on column D.
And put the results on another xlsx file.
My expected output needs to be the content of line 0, 1 or 2 and the content of line 3, 4 or 5.
Can someone help me with that?
Thanks!

I've created the code to that. The code below assumes that the excel name is test.xlsx and resides in the same folder as where you run your code. It samples NrandomLines from each unique value in column D and prints that out.
import pandas as pd
import numpy as np
import random
df = pd.read_excel('test.xlsx') # read the excel
vals = df.D.unique() # all unique values in column D, in your case its only 2 and 7
idx = []
N = []
for i in vals: # loop over unique values in column D
locs = (df.D==i).values.nonzero()[0]
idx = idx + [locs] # save row index of every unique value in column D
N = N + [len(locs)] # save how many rows contain specific value in D
NrandomLines = 1 # how many random samples you want
for i in np.arange(len(vals)): # loop over unique values of D
for k in np.arange(NrandomLines): # loop how many random samples you want
randomRow = random.randint(0,N[i]-1) # create random sample
print(df.iloc[idx[i][randomRow],:]) # print out random row

With OpenPyXl, you can use Worksheet.iter_rows to iterate the worksheet rows.
You can use itertools.groupby to group the row according to the "D" column values.
To do that, you can create a small function to pick-up this value in a row:
def get_d(row):
return row[3].value
Then, you can use random.choice to choose a row randomly.
Putting all things togather, you can have:
def get_d(row):
return row[3].value
for key, group in itertools.groupby(rows, key=get_d):
row = random.choice(list(group))
print(row)

Python csv; get max length of all columns then lengthen all other columns to that length

I have a directory full of data files in the following format:
4 2 5 7
1 4 9 8
8 7 7 1
4 1 4
1 5
2 0
1 0
0 0
0 0
They are separated by tabs. The third and fourth columns contain useful information until they reach 'zeroes'.. At which point, they are arbitrarily filled with zeroes until the end of file.
I want to get the length of the longest column where we do not count the 'zero' values on the bottom. In this case, the longest column is column 3 with a length of 7 because we disregard the zeros at the bottom. Then I want to transform all the other columns by packing zeroes on them until their length is equal to the length of my third column (besides column 4 b/c it is already filled with zeroes). Then I want to get rid of all the zeros beyond my max length in all my columns.. So my desired file output will be as follows:
4 2 5 7
1 4 9 8
8 7 7 1
0 4 1 4
0 0 1 5
0 0 2 0
0 0 1 0
These files consist of ~ 100,000 rows each on average... So processing them takes a while. Can't really find an efficient way of doing this. Because of the way file-reading goes (line-by-line), am I right in assuming that in order to find the length of a column, we need to process in the worst case, N rows? Where N is the length of the entire file. When I just ran a script to print out all the rows, it took about 10 seconds per file... Also, I'd like to modify the file in-place (over-write).

Hi I would use Pandas and Numpy for this:
import pandas as pd
import numpy as np
df = pd.read_csv('csv.csv', delimiter='\t')
df = df.replace(0,np.nan)
while df.tail(1).isnull().all().all() == True:
df=df[0:len(df)-1]
df=df.replace(np.nan,0)
df.to_csv('csv2.csv',sep='\t', index=False) #i used a different name just for testing
You create a DataFrame with your csv data.
There are a lot of built in functions that deal with NaN values, so change all 0s to nan.
Then start at the end tail(1) and check if the row is all() NaN. If so copy the DF less the last row and repeat.
I did this with 100k rows and it takes only a few seconds.

Here are two ways to do it:
# Read in the lines and fill in the zeroes
with open('input.txt') as input_file:
data = [[item.strip() or '0'
for item in line.split('\t')]
for line in input_file]
# Delete lines near the end that are only zeroes
while set(data[-1]) == {'0'}:
del data[-1]
# Write out the lines
with open('output.txt', 'wt') as output_file:
output_file.writelines('\t'.join(line) + '\n' for line in data)
Or
with open('input.txt') as input_file:
with open('output.txt', 'wt') as output_file:
for line in input_file:
line = line.split('\t')
line = [item.strip() or '0' for item in line]
if all(item == '0' for item in line):
break
output_file.write('\t'.join(line))
output_file.write('\n')

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!

Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select multiple columns within a certain range of another column - python

Related

Converting a 1D list into a 2D DataFrame

how to write a list in a file with a specific format?

get a random item from a group of rows in a xlsx file in python

Python csv; get max length of all columns then lengthen all other columns to that length

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

Categories

Resources