Rearrange data file - python

I am trying to reorganize a .txt file containing a list of data with traits in the columns and the family on the rows. Basically, I need to write a program that creates rows comparing the people in each family so that the traits persons 1 and 2, 1 and 3, and 2 and 3 are compared. i.e.:
A 1 2 7 8 9 10
A 1 3 7 9 9 11
etc.
where A is the family, the first 2 numbers are the people compared, the 3rd and 4th numbers are trait1 such as the measurements for each person, and the final numbers are trait2 such as the BMI values for each person.
My input is like this:
A 1 trait trait
A 2 trait trait
A 3 trait trait
I was able to create a data frame using:
data = pandas.read_csv('family.txt.', sep=" ", header = None)
print(data)
I cannot seem to figure out an efficient way to concatenate the data into the rows needed above. Any help is greatly appreciated!
Thank you

Ok, Consider your data was as follows
A 1 7 4 5 6
A 2 6 5 4 7
A 3 7 7 5 4
B 1 7 4 5 6
B 2 6 5 4 7
B 3 7 7 5 4
Where the first column is the family and the second column is the person_id and all subsequent columns are traits.
Some super dirty and super hastily written code below seems to give you what you want
file_lines = []
out_list = []
final_out = []
def read_file():
global file_lines
with open("sample.txt", 'r') as fd:
file_lines = fd.read().splitlines()
print file_lines
def make_output():
global file_lines, out_list, final_out
out_line = []
for line1 in file_lines:
for line2 in file_lines:
line1c = line1.split(" ")
line2c = line2.split(" ")
if line1c[0] == line2c[0]:
if line1c[1] >= line2c[1]:
continue
else:
out_list = []
out_list.append(line1c[0])
out_list.append(line1c[1])
out_list.append(line2c[1])
for i in range(2, len(line1c)):
out_list.append(line1c[i])
out_list.append(line2c[i])
print " ".join(out_list)
read_file()
make_output()
The output of print is
A 1 2 7 6 4 5 5 4 6 7
A 1 3 7 7 4 7 5 5 6 4
A 2 1 6 7 5 4 4 5 7 6
A 2 3 6 7 5 7 4 5 7 4
A 3 1 7 7 7 4 5 5 4 6
A 3 2 7 6 7 5 5 4 4 7
B 1 2 7 6 4 5 5 4 6 7
B 1 3 7 7 4 7 5 5 6 4
B 2 1 6 7 5 4 4 5 7 6
B 2 3 6 7 5 7 4 5 7 4
B 3 1 7 7 7 4 5 5 4 6
B 3 2 7 6 7 5 5 4 4 7
As you can see In family A person 1 is compared with 2 and 3. 2 is compared with 1 and 3 and 3 is compared with 1 and 2.
Obviously there will be duplication because each person is compared with every other person in the family twice.
It's trivial to remove this by maintaining a list of who has been compared with whom.
P.S: I know the script is really dirty but I just wanted to illustrate what i've done. Not write production code
EDIT: I wanted to write a slightly more complicated duplicate remover. But since the data is so simple a small modification in the continue criterion solved it. the output after this edit is
A 1 2 7 6 4 5 5 4 6 7
A 1 3 7 7 4 7 5 5 6 4
A 2 3 6 7 5 7 4 5 7 4
B 1 2 7 6 4 5 5 4 6 7
B 1 3 7 7 4 7 5 5 6 4
B 2 3 6 7 5 7 4 5 7 4
which is free of duplicates

Related

Calculate count of a column based on other column in python dataframe

I have a dataframe like below having patients stay in ICU (in hours) that is shown by ICULOS.
df # Main dataframe
dfy = df.copy()
dfy
P_ID
ICULOS
Count
1
1
5
1
2
5
1
3
5
1
4
5
1
5
5
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
3
1
3
3
2
3
3
3
3
4
1
7
4
2
7
4
3
7
4
4
7
4
5
7
4
6
7
4
7
7
I calculated their ICULOS Count and placed in the new column named Count using the code:
dfy['Count'] = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
Now, I want to remove those patients based on P_ID whose Count is less than 8. (Note, I want to remove whole patient record). So, after removing the patients with Count < 8, Only the P_ID = 2 will remain as the count is 9.
The desired output:
P_ID
ICULOS
Count
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
I tried the following code, but for some reason, it is not working for me. It did worked for me but when I re-run the code after few days, it is giving me 0 result. Can someone suggest a better code? Thanks.
dfy = dfy.drop_duplicates(subset=['P_ID'],keep='first')
lis1 = dfy['P_ID'].tolist()
Icu_less_8 = dfy.loc[dfy['Count'] < 8]
lis2 = Icu_less_8.P_ID.to_list()
lis_3 = [k for k in tqdm_notebook(lis1) if k not in lis2]
# removing those patients who have ICULOS of less than 8 hours
df_1 = pd.DataFrame()
for l in tqdm_notebook(lis_3, desc = 'Progress'):
df_1 = df_1.append(df.loc[df['P_ID']==l])
You can directly filter rows in transform using Series.ge:
In [1521]: dfy[dfy.groupby(['P_ID'])['ICULOS'].transform('count').ge(8)]
Out[1521]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9
EDIT after OP's comment: For multiple conditions, do:
In [1533]: x = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
In [1539]: dfy.loc[x[x.ge(8) & x.le(72)].index]
Out[1539]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9

Efficiently calculate co-occurence matrix

I have a large dataframe (5 rows x 92,579 columns) in the following format:
1 2 3 4 5 6 7 8 9 10 11 ... 92569 92570 92571 92572 92573 92574 92575 92576 92577 92578 92579
0 10 9 8 5 5 10 1 1 6 2 3 ... 9 1 8 3 2 5 5 5 2 2 8
1 3 1 7 4 4 3 8 8 3 6 7 ... 1 8 7 5 6 4 4 4 2 6 7
2 6 4 2 9 7 6 5 5 6 7 2 ... 4 5 2 6 6 9 5 9 3 10 2
3 3 8 4 4 7 3 1 1 3 7 6 ... 8 1 5 7 2 4 1 4 6 10 2
4 4 6 5 5 5 4 1 1 4 8 10 ... 6 1 7 3 6 5 5 5 8 2 9
Each of the entries ranges from 1 to 10 (representing an assignment to one of 10 clusters).
I want to create a 92579 x 92579 matrix that represents how many times (ie. in how many rows) the variables in columns i and j have the same value. For example, variables 4 and 5 have the same value in 3 rows, so entries i_{4,5} and i_{5,4} of the co-occurrence matrix should be 3.
I only need the upper triangular portion of the desired matrix (since it will be symmetric).
I've looked at similar questions here, but they don't address both of these issues:
How to do this efficiently for a very large matrix
How to do this for non-binary entries

how to split csv file data in batches?

I have a csv file, with number of lines is multiples of 16.
After reading, I want iterate and inspect each of 16 rows of data.
ex: following file has lines, which is multiple of 2
1 2 4
4 5 6
4 5 7
3 4 7
6 7 1
3 1 8
then I want to divide these lines into 3 tables
1 2 4
4 5 6
4 5 7
3 4 7
6 7 1
3 1 8
and iterate each of these individual table.
Thanks a lot
There are a lot of ways you can do this. One was is to use numpy to create the groupings and then use groupby to perform the iteration.
print(df)
a b c
0 1 2 4
1 4 5 6
2 4 5 7
3 3 4 7
4 6 7 1
5 3 1 8
groups = np.arange(len(df)) // 2
for idx, subset in df.groupby(groups):
print(subset)
print("-" * 10)
# prints:
a b c
0 1 2 4
1 4 5 6
----------
a b c
2 4 5 7
3 3 4 7
----------
a b c
4 6 7 1
5 3 1 8
----------
If you don't want the entire data at once, and just need specific number of rows at a time, you can consider reading the csv file in chunk rather than reading the entire data at once.
Something like this will work:
fileName = 'sample.csv'
batchSize = 16
for df in pd.read_csv(fileName, chunksize=batchSize):
process the chunk..

Bundle columns of a DataFrame into hierarchical index

If I have preexisting columns (say 12 columns, all with unique names), and I want to organize them into two "header" columns, such as 8 assigned to Detail and 4 assigned to Summary, what is the most effective approach besides sorting them, manually creating a new index, and then swapping out the indices?
Happy to provide more example detail, but that's the gist of what is pretty generic problem.
Need to use multi-index of columns capability. It's important to rename() columns before reindex() so no data is lost.
df = pd.DataFrame({f"col-{i}":[random.randint(1,10) for i in range(10)] for i in range(12)})
header = [f"col-{i}" for i in range(8)]
header
# build a multi-index
mi = pd.MultiIndex.from_tuples([tuple(["Header" if c in header else "Detail", c])
for c in df.columns], names=('Category', 'Name'))
# rename before reindex to prevent data loss
df = df.rename(columns={c:mi[i] for i,c in enumerate(df.columns)}).reindex(columns=mi)
print(df.to_string())
output
Category Header Detail
Name col-0 col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 col-9 col-10 col-11
0 5 5 6 1 8 3 8 6 8 2 8 10
1 2 7 10 5 2 10 5 10 10 7 6 1
2 10 1 1 2 7 9 2 9 4 4 7 6
3 8 10 1 3 3 4 10 10 9 7 6 8
4 6 8 7 2 5 4 3 3 7 9 8 6
5 6 4 4 4 1 5 8 4 4 1 6 8
6 3 7 3 8 8 4 6 1 5 10 5 10
7 5 1 10 9 9 7 8 2 6 7 10 4
8 2 2 1 4 8 8 7 2 5 9 9 9
9 8 6 5 6 2 8 2 8 10 7 9 3

nested loops results bunched together Python

for j in range(10):
for i in range(10):
print(j,end=" ")
My results are bunched together and I need to have 10 numbers per line. I cant use a print("0123456789"). I have tried print(j,j,j,j,j,j,j,j,j) and I get the results that I'm looking for but I'm sure this isn't the proper way to write the code.
If print(j,j,j,j,j,j,j,j,j) works then you simply need to add another print() after each iteration:
for j in range(10):
for i in range(10):
print(j,end=" ")
print()
Output:
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9
Or simply:
for j in range(10):
print(" ".join(str(j) * 10))
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9
Why are you using a nested for loop when you can use a single for loop:
for i in range(10):
print('{} '.format(i) * 10)
This is similar to Malik Brahimi's solution, except it doesn't put a space after the last digit on each line:
for i in range(10):
print(' '.join([str(i)]*10))
output
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9
Just for fun, here's another way to do it with a single loop, this time using a format string with numbered fields.
fmt = ('{0} ' * 10)[:-1]
for i in range(10):
print(fmt.format(i))

Categories

Resources