How do I insert missing rows and conduct math between multiple files? - python

I have two files that contain two columns each. The first column is an integer. The second column is a linear coordinate. Not every coordinate is represented, and I would like to insert all coordinates that are missing. Below is an example from one file of my data:
3 0
1 10
1 100
2 1000
1 1000002
1 1000005
1 1000006
For this example, coordinates 1-9, 11-99, etc are missing but need to be inserted, and need to be given a count of zero (0).
3 0
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
1 10
........
With the full set of rows, I then need to add add (1) to every count (the first column). Finally, I would like to do a simple calculation (the ratio) between the corresponding rows of the first column in the two files. The ratio should be real numbers.
I'd like to be able to do this with Unix if possible, but am somewhat familiar with python scripting as well. Any help is greatly appreciated.

This should work with Python 2.3 onwards.
I assumed that your file is space delimited.
If you want values past 1000006, you will need to change the value for desired_range .
import csv
desired_range = 1000007
reader = csv.reader(open('fill_range_data.txt'), delimiter=' ')
data_map = dict()
for row in reader:
frequency = int(row[0])
value = int(row[1])
data_map[value] = frequency
for i in range(desired_range):
if i in data_map:
print data_map[i], i
else:
print 0, i

Related

Pandas - Equal occurrences of unique type for a column

I have a Pandas DF called “DF”. I would like to sample data from the population in such a way that, given a occurrence count, N = 100 and column = "Type", I would like to print a total of 100 rows from that column in such a way that the distribution of occurrences of each type is equal.
SNo
Type
Difficulty
1
Single
5
2
Single
15
3
Single
4
4
Multiple
2
5
Multiple
14
6
None
7
7
None
4323
For instance, If I specify N = 3, the output must be :
SNo
Type
Difficulty
1
Single
5
3
Multiple
4
6
None
7
If for the number N, the occurrences of certain types do not meet the minimum split, I can randomly increase another count.
I am wondering on how to approach this programmatically. Thanks!
Use groupby.sample (pandas ≥ 1.1) with N divided by the number of types.
NB. This assumes the N is a multiple of the number of types if you want a strict equality.
N = 3
N2 = N//df['Type'].nunique()
out = df.groupby('Type').sample(n=N2)
handling non multiple of the number of types
Use the same as above and complete to N with random rows excluding those already selected.
N = 5
N2, R = divmod(N, df['Type'].nunique())
out = df.groupby('Type').sample(n=N2)
out = pd.concat([out, df.drop(out.index).sample(n=R)])
As there is still a chance that you complete with items of the same group, if you really want to ensure sampling from different groups replace the last step with:
out = pd.concat([out, df.drop(out.index).groupby('Type').sample(n=1).sample(n=R)]
Example output:
SNo Type Difficulty
4 5 Multiple 14
6 7 None 4323
2 3 Single 4
3 4 Multiple 2
5 6 None 14

how to write a list in a file with a specific format?

I have a Python list and wanna reprint that in a special way.
input:
trend_end= ['skill1',10,0,13,'skill2',6,1,0,'skill3',5,8,9,'skill4',9,0,1]
I want to write a file like this:
output:
1 2 3
1 10 0 13
2 6 1 0
3 5 8 9
4 9 0 1
Basically, I need to do the following steps:
Separate elements of the list for each skill.
Write them in a table shape, add indices of columns and rows.
I wanna use it as an input of another software. That's why I wanna write a file.
I did this but I know it is wrong, can you see how I can fix it?
f1 = open("data.txt", "a")
for j in trend_end:
f1.write(str(j))
for i in range(1,int(len(trend_end)/df1ana.shape[0])):
G=[trend_end[i*(df1ana.shape[0]-10)- (df1ana.shape[0]-10):i*(df1ana.shape[0]-10)]]
for h in G:
f1.write(i)
f1.write(h)
f1.write('\n')
f.close()
df1ana.shape[0] is 3 in the above example. It is basically the length of data for each skill
Another option that you can try via pandas:
import pandas as pd
pd.DataFrame([trend_end[i+1:i+4] for i in range(0,len(trend_end),4)]).to_csv('data.txt', sep='\t')
OUTPUT:
0 1 2
0 10 0 13
1 6 1 0
2 5 8 9
3 9 0 1
You should iterate over the list in steps of 4, i.e. df1ana.shape[0]+1
steps = df1ana.shape[0]+1
with open("data.txt", "a") as f:
f.write(' ' + ' '.join(range(1, steps)) + '\n') # write header line
for i in range(1, len(trend_end), steps):
f.write(f"{i:<3}")
for j in range(i, i+steps-1):
f.write("f{trend_end[j]:<3}")
f.write("\n")
The :<3 formatting puts each value in a 3-character, left-aligned field.
This should work regardless of the number of groups or the number of records per group. It uses the difference in the size of the full list compared to the integer only list to calculate the number of rows you should have, and uses the ratio of the number of integers over the number of rows to get the number of columns.
import numpy as np
import pandas as pd
digits = [x for x in trend if isinstance(x,int)]
pd.DataFrame(np.reshape(digits,
(int(len(trend)-len(digits)),
int(len(digits)/(len(trend)-len(digits)))))).to_csv('output.csv')

Using 2D Array to track value across rows and columns to the first occurrence of the value?

I searched around for something like this but couldn't find anything. Also struggled with the wording of my question so bear with me here. I'm working with data provided in an excel file and have excel/VBA and python available to use. I'm still learning both and I just need a push in the right direction on what method to use so I can work through it and learn.
Say I have a series of 5 processes in a manufacturing facility, each process represented by a column. Occasionally a downstream process (column 5) gets backed up and slows down upstream processes (column 4, then 3, etc). I have a 2D array that indicates running(0) or backed up(1). Each column is an individual process and each row is a time interval. The array will be the same size every time (10000+ rows, 5 columns), but the values change. Example:
MyArray = [0 0 0 0 0
0 1 0 0 0
1 1 0 0 1
1 0 0 0 1
0 0 0 1 1
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0
0 1 0 0 0
0 0 0 0 0]
Essentially when there is a value of 1, I want to trace it to the upper rightmost adjacent 1. So for column 1, rows 2 & 3 would be traced back to column 2, row 2. For column 2, rows 8 & 9 would be traced back to column 5, row 3. Currently I have just have an if statement looking to the right within the same row and it's better than nothing, but does not capture the cascade effect you get when something backs up multiple upstream processes.
I could figure out a way to do it by looping a search to look up in a column until it finds a zero, then if the next column value is a 1, search up until finds a zero, repeat. But this seems really inefficient and I feel like there has to be a better way to do it. Any kinds of thoughts or comments would be very helpful. Thanks.
One tip is to actually begin from the bottom row (let's call this row r=0) and first column c=1 and search rightward until you hit column c=5. If no 1s are found then repeat the search for the second-to-bottom row, and continue until you search the top row.
Whenever a 1 is found at some index (r, c), recursively ask the question "does this element have all zeros above (r+1, c), to the right (r, c+1), or above and to the right (r+1, c+1). Now, I'm not sure what you do in the case of a "tie", i.e. when there is a 1 above, at (r+1, c), and another 1 to the right, at (r, c+1), but a 0 at (r+1, c+1), but you could address this competition in a simple if statement.

Python csv; get max length of all columns then lengthen all other columns to that length

I have a directory full of data files in the following format:
4 2 5 7
1 4 9 8
8 7 7 1
4 1 4
1 5
2 0
1 0
0 0
0 0
They are separated by tabs. The third and fourth columns contain useful information until they reach 'zeroes'.. At which point, they are arbitrarily filled with zeroes until the end of file.
I want to get the length of the longest column where we do not count the 'zero' values on the bottom. In this case, the longest column is column 3 with a length of 7 because we disregard the zeros at the bottom. Then I want to transform all the other columns by packing zeroes on them until their length is equal to the length of my third column (besides column 4 b/c it is already filled with zeroes). Then I want to get rid of all the zeros beyond my max length in all my columns.. So my desired file output will be as follows:
4 2 5 7
1 4 9 8
8 7 7 1
0 4 1 4
0 0 1 5
0 0 2 0
0 0 1 0
These files consist of ~ 100,000 rows each on average... So processing them takes a while. Can't really find an efficient way of doing this. Because of the way file-reading goes (line-by-line), am I right in assuming that in order to find the length of a column, we need to process in the worst case, N rows? Where N is the length of the entire file. When I just ran a script to print out all the rows, it took about 10 seconds per file... Also, I'd like to modify the file in-place (over-write).
Hi I would use Pandas and Numpy for this:
import pandas as pd
import numpy as np
df = pd.read_csv('csv.csv', delimiter='\t')
df = df.replace(0,np.nan)
while df.tail(1).isnull().all().all() == True:
df=df[0:len(df)-1]
df=df.replace(np.nan,0)
df.to_csv('csv2.csv',sep='\t', index=False) #i used a different name just for testing
You create a DataFrame with your csv data.
There are a lot of built in functions that deal with NaN values, so change all 0s to nan.
Then start at the end tail(1) and check if the row is all() NaN. If so copy the DF less the last row and repeat.
I did this with 100k rows and it takes only a few seconds.
Here are two ways to do it:
# Read in the lines and fill in the zeroes
with open('input.txt') as input_file:
data = [[item.strip() or '0'
for item in line.split('\t')]
for line in input_file]
# Delete lines near the end that are only zeroes
while set(data[-1]) == {'0'}:
del data[-1]
# Write out the lines
with open('output.txt', 'wt') as output_file:
output_file.writelines('\t'.join(line) + '\n' for line in data)
Or
with open('input.txt') as input_file:
with open('output.txt', 'wt') as output_file:
for line in input_file:
line = line.split('\t')
line = [item.strip() or '0' for item in line]
if all(item == '0' for item in line):
break
output_file.write('\t'.join(line))
output_file.write('\n')

Counting the weighted intersection of equivalent rows in two tables

The following question is a generalization to the question posted here:
Counting the intersection of equivalent rows in two tables
I have two FITS files. For example, the first file has 100 rows and 2 columns. The second file has 1000 rows and 3 columns.
FITS FILE 1 FITS FILE 2
A B C D E
1 2 1 2 0.1
1 3 1 2 0.3
2 4 1 2 0.9
I need to take the first row of the first file, i.e 1 and 2 and check how many rows in the second file have C = 1 and D = 2 weighting each pair (C,D) with respect to the corresponding value in column E.
In the example, I have 3 rows in the second file that have C = 1 and D = 2. They have weights E = 0.1, 0.3, and 0.9, respectively. Weighting with respect to the values in E, I need to associate the value 0.1+0.3+0.9 = 1.3 to the pair (A,B) = (1,2) of the first file. Then, I need to do the same for the second row (first file), i.e 1 and 3 and find out how many rows in the second file have 1 and 3, again weighting with respect to the value in column E, and so on.
The first file does not have duplicates (all the rows have different pairs, none are identical, only file 2 has many identical pairs which I need to find).
I finally need the weighted numbers of rows in the second file that have the similar values as that of the rows of the first FITS file.
The result should be:
A B Number
1 2 1.3 # 1 and 2 occurs 1.3 times
1 3 4.5 # 1 and 3 occurs 4.5 times
and so on for all pairs in A and B columns.
I know from the post cited above that the solution for weights in column E all equal to 1 involves Counter, as follows:
from collections import Counter
# Create frequency table of (C,D) column pairs
file2freq = Counter(zip(C,D))
# Look up frequency value for each row of file 1
for a,b in zip(A,B):
# and print out the row and frequency data.
print a,b,file2freq[a,b]
To answer the question I need to include the weights in E when I use Counter:
file2freq = Counter(zip(C,D))
I was wondering if it is possible to do that.
Thank you very much for your help!
I'd follow up on the suggestion made by Iguananaut in the comments to that question. I believe numpy is an ideal tool for this.
import numpy as np
fits1 = np.genfromtxt('fits1.csv')
fits2 = np.genfromtxt('fits2.csv')
summed = np.zeros(fits1.shape[0])
for ind, row in enumerate(fits1):
condition = (fits2[:,:2] == row).all(axis=1)
summed[ind] = fits2[condition,-1].sum() # change the assignment operator to += if the rows in fits1 are not unique
After the import, the first 2 lines will load the data from the files. That will return an array of floats, which comes with the warning: comparing one float to another is prone to bugs. In this case it will work though, because both the columns in fits1.csv and the first 2 columns in fits2.csv are integers and parsed in the same manner by genfromtxt.
Then, in the for-loop the variable condition is created, which states that anytime the first two columns in fits2 match with the columns of the current row of fits1, it is to be taken into account (the result is a boolean array).
Then, finally, for the current row index ind, set the value of the array summed to the sum of all the values in column 3 of fits2, where the condition was True.
For a mini example I made, I got this:
oliver#armstrong:/tmp/sto$ cat fits1.csv
1 2
1 3
2 4
oliver#armstrong:/tmp/sto$ cat fits2.csv
1 2 .1
1 2 .3
1 2 .9
2 4 .3
1 5 .5
2 4 .7
# run the above code:
# summed is:
# array([ 1.3, 0. , 1. ])

Categories

Resources