Rearrange data in csv with Python - python

I have a .csv file with the following format:
A B C D E F
X1 X2 X3 X4 X5 X6
Y1 Y2 Y3 Y4 Y5 Y6
Z1 Z2 Z3 Z4 Z5 Z6
What I want:
A X1
B X2
C X3
D X4
E X5
F X6
A Y1
B Y2
C Y3
D Y4
E Y5
F Y6
A Z1
B Z2
C Z3
D Z4
E Z5
F Z6
I am unable to wrap my mind around the built-in transpose functions in order to achieve the final result. Any help would be appreciated.

You can simply melt your dataframe using pandas:
import pandas as pd
df = pd.read_csv(csv_filename)
>>> pd.melt(df)
variable value
0 A X1
1 A Y1
2 A Z1
3 B X2
4 B Y2
5 B Z2
6 C X3
7 C Y3
8 C Z3
9 D X4
10 D Y4
11 D Z4
12 E X5
13 E Y5
14 E Z5
15 F X6
16 F Y6
17 F Z6
A pure python solution would be as follows:
file_out_delimiter = ',' # Use '\t' for tab delimited.
with open(filename, 'r') as f, open(filename_out, 'w') as f_out:
headers = f.readline().split()
for row in f:
for pair in zip(headers, row.split()):
f_out.write(file_out_delimiter.join(pair) + '\n')
resulting in the following file contents:
A,X1
B,X2
C,X3
D,X4
E,X5
F,X6
A,Y1
B,Y2
C,Y3
D,Y4
E,Y5
F,Y6
A,Z1
B,Z2
C,Z3
D,Z4
E,Z5
F,Z6

Related

how to solve pandas multi-column explode issue?

I am trying to explode multi-columns at a time systematically.
Such that:
[
and I want the final output as:
I tried
df=df.explode('sauce', 'meal')
but this only provides the first element ( sauce) in this case to be exploded, and the second one was not exploded.
I also tried:
df=df.explode(['sauce', 'meal'])
but this code provides
ValueError: column must be a scalar
error.
I tried this approach, and also this. none worked.
Note: cannot apply to index, there are some none- unique values in the fruits column.
Prior to pandas 1.3.0 use:
df.set_index(['fruits', 'veggies'])[['sauce', 'meal']].apply(pd.Series.explode).reset_index()
Output:
fruits veggies sauce meal
0 x1 y2 a d
1 x1 y2 b e
2 x1 y2 c f
3 x2 y2 g k
4 x2 y2 h l
Many columns? Try:
df.set_index(df.columns.difference(['sauce', 'meal']).tolist())\
.apply(pd.Series.explode).reset_index()
Output:
fruits veggies sauce meal
0 x1 y2 a d
1 x1 y2 b e
2 x1 y2 c f
3 x2 y2 g k
4 x2 y2 h l
Update your version of Pandas
# Setup
df = pd.DataFrame({'fruits': ['x1', 'x2'],
'veggies': ['y1', 'y2'],
'sauce': [list('abc'), list('gh')],
'meal': [list('def'), list('kl')]})
print(df)
# Output
fruits veggies sauce meal
0 x1 y1 [a, b, c] [d, e, f]
1 x2 y2 [g, h] [k, l]
Explode (Pandas 1.3.5):
out = df.explode(['sauce', 'meal'])
print(out)
# Output
fruits veggies sauce meal
0 x1 y1 a d
0 x1 y1 b e
0 x1 y1 c f
1 x2 y2 g k
1 x2 y2 h l

How to create a distance matrix between two places

I have a dataframe that looks like this
origin Destination distance
x1 y1 d11
x2 y1 d21
x3 y1 d31
x1 y2 d12
x2 y2 d22
x3 y2 d32
x1 y3 d13
x2 y3 d23
x3 y3 d33
How do i get an output as a matrix
x1 x2 x3
y1 d11 d21 d31
y2 d12 d22 d32
y3 d13 d23 d33
Also I want the output unsorted.
Have you looked into pivot tables? This would look like
df.pivot(index='origin', columns='Destination', values='distance')

Create a new dataframe with k copies of each row appended to itself

Suppose I have a dataframe with n rows:
Index data1 data2 data3
0 x0 x0 x0
1 x1 x1 x1
2 x2 x2 x2
...
n xn xn xn
How do I create a new dataframe (using pandas) with k copies of each row appended to itself:
Index data1 data2 data3
0 x0 x0 x0
1 x0 x0 x0
...
k-1 x0 x0 x0
k x1 x1 x1
k+1 x1 x1 x1
...
2k-1 x1 x1 x1
2k x2 x2 x2
...
First concat, then sort
The method I'd use is to create a list of duplicate dataframes, concat them together, and then sort_index:
count = 5
new_df = pd.concat([df]*count).sort_index()
Using numpy.repeat and .iloc In here, k=2
df.iloc[np.repeat(np.arange(len(df)), 3)]
Out[256]:
Index data1 data2 data3
0 0 x0 x0 x0
0 0 x0 x0 x0
0 0 x0 x0 x0
1 1 x1 x1 x1
1 1 x1 x1 x1
1 1 x1 x1 x1
2 2 x2 x2 x2
2 2 x2 x2 x2
2 2 x2 x2 x2
Option 1
Use repeat + reindex + reset_index:
df
data1 data2 data3
0 x0 x0 x0
1 x1 x1 x1
2 x2 x2 x2
df.reindex(df.index.repeat(5)).reset_index(drop=1)
data1 data2 data3
0 x0 x0 x0
1 x0 x0 x0
2 x0 x0 x0
3 x0 x0 x0
4 x0 x0 x0
5 x1 x1 x1
6 x1 x1 x1
7 x1 x1 x1
8 x1 x1 x1
9 x1 x1 x1
10 x2 x2 x2
11 x2 x2 x2
12 x2 x2 x2
13 x2 x2 x2
14 x2 x2 x2
Option 2
Similar solution with repeat + pd.DataFrame:
pd.DataFrame(np.repeat(df.values, 5, axis=0), columns=df.columns)
data1 data2 data3
0 x0 x0 x0
1 x0 x0 x0
2 x0 x0 x0
3 x0 x0 x0
4 x0 x0 x0
5 x1 x1 x1
6 x1 x1 x1
7 x1 x1 x1
8 x1 x1 x1
9 x1 x1 x1
10 x2 x2 x2
11 x2 x2 x2
12 x2 x2 x2
13 x2 x2 x2
14 x2 x2 x2
Comparisons
%timeit pd.concat([df] * 100000).sort_index().reset_index(drop=1)
1 loop, best of 3: 14.6 s per loop
%timeit df.iloc[np.repeat(np.arange(len(df)), 100000)].reset_index(drop=1)
10 loops, best of 3: 22.6 ms per loop
%timeit df.reindex(df.index.repeat(100000)).reset_index(drop=1)
10 loops, best of 3: 19.9 ms per loop
%timeit pd.DataFrame(np.repeat(df.values, 100000, axis=0), columns=df.columns)
100 loops, best of 3: 17.1 ms per loop

Rows that appear in two separate txt files, remove from one txt file

Pretty new to python. My issue is that I have one txt file ('A.txt') that has a bunch of columns in it and a second txt file ('B.txt) that has different data. However, some of the data that's in B also shows up in A. Example:
A.txt:
name1 x1 y1
name2 x2 y2
name3 x3 y3
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
B.txt
namea xa ya
name2 x2 y2
name3 x3 y3
nameb xb yb
namec xc yc
...
I want everything in B.txt that shows up in A.txt to be removed from A.txt
I realize this has been asked before, but I have tried the advice that was given to people asking a similar question, but it doesn't work for me.
So far I have:
tot = 0
with open('B.txt', 'r') as f1:
for a in f1:
WR = a.strip().split()
with open('A.txt', 'r+') as f2:
for b in f2:
l = b.strip().split()
if WR not in l:
print l
tot += 1
#I've done it the following way and also doesn't give the
#output I need
#if WR == l: #find duplicates
# continue
#else:
# print l
print tot
When I run this I get back what I think is the answer (file A has 2060 file B has 154) but repeated 154 times.
So example of what I mean is:
A.txt:
name1 x1 y1
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
name1 x1 y1
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
name1 x1 y1
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
name1 x1 y1
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
I only want it to look like:
A.txt:
name1 x1 y1
name4 x4 y4
name5 x5 y5
name6 x6 y6
...
Like I've said, I've already looked at the other similar questions and tried what they did and it's giving me this repeating answer. Any suggestions would greatly be appreciated!
I just copied your A and B files so just check and let me know if its correct
tot = 0
f1 = open('B.txt', 'r')
f2 = open('A.txt', 'r')
lista = []
for line in f1:
line.strip()
lista.append(line)
listb = []
for line in f2:
line.strip()
listb.append(line)
for b in listb:
if b in lista:
continue
else:
print(b)
tot+=1
This code prints the line, but if you want you can write it to a file too

xlrd data extraction python

I am working on data extraction using xlrd and I have extracted 8 columns of inputs for my project. Each column of data has around 100 rows. My code is as follows:
wb = xlrd.open_workbook('/Users/Documents/Sample_data/AI_sample.xlsx')
sh = wb.sheet_by_name('Sample')
x1 = sh.col_values( + 0)[1:]
x2 = sh.col_values( + 1)[1:]
x3 = sh.col_values( + 2)[1:]
x4 = sh.col_values( + 3)[1:]
x5 = sh.col_values( + 4)[1:]
x6 = sh.col_values( + 5)[1:]
x7 = sh.col_values( + 6)[1:]
x8 = sh.col_values( + 7)[1:]
Now I want to create an array of inputs which gives each row of the 8 columns.
For eg: if this is my 8 columns of data
x1 x2 x3 x4 x5 x6 x7 x8
1 2 3 4 5 6 7 8
7 8 6 5 2 4 8 8
9 5 6 4 5 1 7 5
7 5 6 3 1 4 5 6
i want something like: x1, x2, x3, x4, x5, x6 ([1,2,3,4,5,6,7,8]) for all the 100+ rows.
I could have done a row wise extraction but, doing that for 100+ rows is practically very difficult. So how do i do that. i also understand that it could be done using np.array. but i do not know how.
You can also try openpyxl something similar to xlrd
from openpyxl import load_workbook,Workbook
book = load_workbook(filename=file_name)
sheet = book['sheet name']
for row in sheet.rows:
col_0 = row[0].value
col_1 = row[1].value
I used to prefer openpyxl instead of xlrd
I found this piece of code very useful
X = np.array([x1, x2, x3, x4, x5, x6, x7, x8])
return X.T

Categories

Resources