extracting/manipulating tab-delimited data with string and integers in python

extracting/manipulating tab-delimited data with string and integers in python - python

I have a tab-delimited file with three columns (Name Nr1 Nr2) like the following:
ABC 201 215
DEF 301 320
GHI 350 375
I would like to transfer the last file into the following format:
ABC 201 201 #taking the value from the first value from the second column and continue line by line till the second value in the third line as the following
ABC 202 202
ABC 203 203
......and so on till the third column value
ABC 215 215
DEF 301 301
....and so on till the third column value
DEF 320 320
GHI 350 350
GHI 351 351
GHI 351 351
....
GHI 375 375
is that possible in python?
I would really appreciate your help in this
Thanks in advance

Using the method here: How do I read a file line-by-line into a list?
You can take each line of the file and make it into an array.
lines = tuple(open(filename, 'r'))
As shown here: splitting a string based on tab in the file
You can then split each array value by the tab delimiter.
import re
line_array = re.split(r'\t+', lines[0])

Related

Compare a file line by line and for those lines that meet the given requirement, print them

I have a txt file with content:
577 181 619 216
603 175 630 202
651 180 681 202
661 152 676 179
604 176 630 204
605 177 632 202
I want to read each line of this file and compare each line with one another and if for e.g. line i - line j <= 3 then remove that line and output only one between those lines.
For above content I want the output as:
577 181 619 216
603 175 630 202
651 180 681 202
661 152 676 179
In this case second line 603 175 630 202 falls under above condition so other 2 lines:5 and 6 are removed and only line 2 is written to output as given above.
f1 = open("result.txt", "r")
f2 = open("final.txt", "w" )
for line1 in f1:
for line2 in f1:
if each number line2 - line1 <= 3:
#remove one of those line and write the remaining line to new file
#f2.write(lines)
f1.close()
f2.close()
For example if you look at line 2, 5 and 6, each adjacent number in each line, the difference between is less then 3 i.e For line 2 and 5, the first element are 603 and 604 ( 603 -604 =1 i.e less then 3) and the second element 175 - 176 =1, 3rd element 630 -630 =0 and 4th element 202 - 204 = 2 i.e less then 3, all this falls under the given condition and hence for 1st and 5th line only one line is enough

For starters, you need to convert the lines into the numbers.
With that, find absolute difference across the lines.
for i, line1 in enumerate(f1);
nums1 = list(map(int, line1.strip().split()))
for j, line2 in enumerate(f1);
if j <= i: # skip repeating + equal lines
continue
nums2 = list(map(int, line2.strip().split()))
diffs = [abs(nums2[x] - nums1[x]) for x in range(len(nums1))]
print(f'Diff between {nums1} and {nums2}: {diffs}') # for debugging
# check all the differences something like this...
if not all(d <= 3 for d in diffs):
f2.write(line1)

Breaking from parts of a loop in python, but continuing with the rest

Assume we have a text file named MyFile.txt, which is structured like this:
some info
some more info
Unit: Unit1
E 32 5 5 ee
R 123 534 345 543 634 634 345
R 543 634 634 345 123 534 345
We want to store data from the lines that start with R in a list called Data (which will later be converted to a dataframe), and we want to add the "Unit name" (Unit1 here) to the end of each line, so that the end result would look like this:
print(Data)
R 123 534 345 543 634 634 345 Unit1
R 543 634 634 345 123 534 345 Unit1
The following function fn will loop through each line, store lines that start with R in UN, store the word Unit1 in a new list called UN, and append it to the end of each line:
UN = []
Data = []
def fn(FileName):
with open(FileName, "r") as fi:
for line in fi:
if line.startswith("Unit"):
UN.append(line.split()[1])
elif line.startswith("R"):
Data.append(line.split()[0:] + list(UN))
In the event that our text file has two lines that start with Unit:
some info
some more info
Unit: Unit1
E 32 5 5 ee
Unit: Unit1
R 123 534 345 543 634 634 345
R 543 634 634 345 123 534 345
The function above would append Unit1 to the end of each line twice, resulting in this:
R 123 534 345 543 634 634 345 Unit1 Unit1
R 543 634 634 345 123 534 345 Unit1 Unit1
How can we stop the loop after it finds the unit name one time, but continue with the rest of the loop so that it only appends the unit name once?

Is there a specific reason you are using list to store the unit? would both units be the same? in this case instead of storing the unit in a list you can simply store it in a variable, if your loop comes across it a second time it just overwrites the variable so you are still only appending one unit.
UN = ''
Data = []
def fn(FileName):
with open(FileName, "r") as fi:
for line in fi:
if line.startswith("Unit"):
UN = line.split()[1]
elif line.startswith("R"):
Data.append(line.split()[0:] + [UN])
if you want to make sure you only pick the first entry of unit you can add an additional if statement to check if 'UN' is empty or not.

How to merge csv file with different column numbers using Pandas

I want merge multiple file with different number of columns. Seem I can merge but I'm not able to write the column names.
indir="/home/centos/Data/MERGED/"
fileList=glob.glob(indir+"*.tsv")
dd=[pd.read_csv(f,sep="\t",header=0)for f in fileList]
result=pd.concat(dd,axis=1, join='inner', ignore_index=True,sort=False)
column_file=[]
for f in fileList:
tp=pd.read_csv(f,sep="\t",header=0)
print(len(tp.columns.tolist()))
column_file.append(",".join(tp.columns.tolist()))
344
119
177
304
502
178
36
80
478
375
502
166

find value in column and based on it create a new dataframe in pandas

I have a variable in the following format fg = 2017-20. It's a string. And also I have a dataframe:
flag №
2017-18 389
2017-19 390
2017-20 391
2017-21 392
2017-22 393
2017-23 394
...
I need to find this value (fg) in the column "flag" and select the appropriate value (in the example it will be 391) in the column "№". Then create new dataframe, in which there will also be a column "№". Add this value to this dataframe and iterate 53 times. The result should look like this:
№_new
391
392
393
394
395
...
442
443
444
It does not look difficult, but I can not find anything suitable based on other issues. Can someone advise anything, please?

You need boolean indexing with loc for filtering, then convert one item Series to scalar by convert to numpy array by values and select first value by [0].
Last create new DataFrame with numpy.arange.
fg = '2017-20'
val = df.loc[df['flag'] == fg, '№'].values[0]
print (val)
391
df1 = pd.DataFrame({'№_new':np.arange(val, val+53)})
print (df1)
№_new
0 391
1 392
2 393
3 394
4 395
5 396
6 397
7 398
8 399
9 400
10 401
11 402
..
..

"ValueError: labels ['timestamp'] not contained in axis" error

I have this code ,i want to remove the column 'timestamp' from the file :u.data but can't.It shows the error
"ValueError: labels ['timestamp'] not contained in axis"
How can i correct it
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
data = pd.read_table('u.data')
data.columns=['userID', 'itemID','rating', 'timestamp']
data.drop('timestamp', axis=1)
N = len(data)
print data.shape
print list(data.columns)
print data.head(10)

One of the biggest problem that one faces and that undergoes unnoticed is that in the u.data file while inserting headers the separation should be exactly the same as the separation between a row of data. For example if a tab is used to separate a tuple then you should not use spaces. In your u.data file add headers and separate them exactly with as many whitespaces as were used between the items of a row.
PS: Use sublime text, notepad/notepad++ does not work sometimes.

"ValueError: labels ['timestamp'] not contained in axis"
You don't have headers in the file, so the way you loaded it you got a df where the column names are the first rows of the data. You tried to access colunm timestamp which doesn't exist.
Your u.data doesn't have headers in it
$head u.data
196 242 3 881250949
186 302 3 891717742
So working with column names isn't going to be possible unless add the headers. You can add the headers to the file u.data, e.g. I opened it in a text editor and added the line a b c timestamp at the top of it (this seems to be a tab-separated file, so be careful when added the header not to use spaces, else it breaks the format)
$head u.data
a b c timestamp
196 242 3 881250949
186 302 3 891717742
Now your code works and data.columns returns
Index([u'a', u'b', u'c', u'timestamp'], dtype='object')
And the rest of the trace of your working code is now
(100000, 4) # the shape
['a', 'b', 'c', 'timestamp'] # the columns
a b c timestamp # the df
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
5 298 474 4 884182806
6 115 265 2 881171488
7 253 465 5 891628467
8 305 451 3 886324817
9 6 86 3 883603013
If you don't want to add headers
Or you can drop the column 'timestamp' using it's index (presumably 3), we can do this using df.ix below it selects all rows, columns index 0 to index 2, thus dropping the column with index 3
data.ix[:, 0:2]

i would do it this way:
data = pd.read_table('u.data', header=None,
names=['userID', 'itemID','rating', 'timestamp'],
usecols=['userID', 'itemID','rating']
)
Check:
In [589]: data.head()
Out[589]:
userID itemID rating
0 196 242 3
1 186 302 3
2 22 377 1
3 244 51 2
4 166 346 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracting/manipulating tab-delimited data with string and integers in python - python

Related

Compare a file line by line and for those lines that meet the given requirement, print them

Breaking from parts of a loop in python, but continuing with the rest

How to merge csv file with different column numbers using Pandas

find value in column and based on it create a new dataframe in pandas

"ValueError: labels ['timestamp'] not contained in axis" error

Categories

Resources