extracting a column from the text file containing headers and delimiters

extracting a column from the text file containing headers and delimiters - python

I have a text file which looks like this:
~Date and Time of Data Converting: 15.02.2019 16:12:44
~Name of Test: XXX
~Address: ZZZ
~ID: OPP
~Testchannel: CH06
~a;b;DateTime;c;d;e;f;g;h;i;j;k;extract;l;m;n;o;p;q;r
0;1;04.03.2019 07:54:19;0;0;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;3;0;0;0;0;0
5,5523894132E-7;2;04.03.2019 07:54:19;5,5523894132E-7;5,5523894132E-7;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;0;0;0;0;0;0
0,00277777777779538;3;04.03.2019 07:54:29;0,00277777777779538;0,00277777777779538;2;Pause;3,5724446855812;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,00555555532278617;4;04.03.2019 07:54:39;0,00555555532278617;0,00555555532278617;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;1;0;0;0;0;0
0,00833333333338613;5;04.03.2019 07:54:49;0,00833333333338613;0,00833333333338613;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,0111112040002119;6;04.03.2019 07:54:59;0,0111112040002119;0,0111112040002119;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,013888887724954;7;04.03.2019 07:55:09;0,013888887724954;0,013888887724954;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
I need to extract the values from the column named extract, and need to store the output as an excel file.
Can anyone give me any idea how I can proceed?
So far, I have only been able to create an empty excel file for the output, and I have read the text file. I however don't know how to append output to the empty excel file.
import os
file=open('extract.csv', "a")
if os.path.getsize('extract.csv')==0:
file.write(" "+";"+"Datum"+";"+"extract"+";")
with open('myfile.txt') as f:
dat=[f.readline() for x in range(10)]
datum=dat[7].split(' ')[3]
data = np.genfromtxt('myfile.txt', delimiter=';', skip_header=12,dtype=str)

You can use the pandas module.
You need to read skip the first lines of your text file. Here, I consider not to know how many there are. I loop until I find a data row.
Then read the data.
Finaly, export it as dataframe with to_excel (doc)
Here the code:
# Import module
import pandas as pd
# Read file
with open('temp.txt') as f:
content = f.read().split("\n")
# Skip the first lines (find number start data)
for i, line in enumerate(content):
if line and line[0] != '~': break
# Columns names and data
header = content[i - 1][1:].split(';')
data = [row.split(';') for row in content[i:]]
# Store in dataframe
df = pd.DataFrame(data, columns=header)
print(df)
# a b DateTime c d e f ... l m n o p q r
# 0 0 1 04.03.2019 07:54:19 0 0 2 Pause ... 1 3 0 0 0 0 0
# 1 5,5523894132E-7 2 04.03.2019 07:54:19 5,5523894132E-7 5,5523894132E-7 2 Pause ... 1 0 0 0 0 0 0
# 2 0,00277777777779538 3 04.03.2019 07:54:29 0,00277777777779538 0,00277777777779538 2 Pause ... 1 1 0 0 0 0 0
# 3 0,00555555532278617 4 04.03.2019 07:54:39 0,00555555532278617 0,00555555532278617 2 Pause ... 1 1 0 0 0 0 0
# 4 0,00833333333338613 5 04.03.2019 07:54:49 0,00833333333338613 0,00833333333338613 2 Pause ... 1 1 0 0 0 0 0
# 5 0,0111112040002119 6 04.03.2019 07:54:59 0,0111112040002119 0,0111112040002119 2 Pause ... 1 1 0 0 0 0 0
# 6 0,013888887724954 7 04.03.2019 07:55:09 0,013888887724954 0,013888887724954 2 Pause ... 1 1 0 0 0 0 0
# Select only the Extract column
# df = df.Extract
# Save the data in excel file
df.to_excel("OutPut.xlsx", "MySheetName", index=False)
Note: if you know the number of lines to skip, you can simply load the dataframe with read_csv using the skiprows parameter. (doc).
Hope that helps!

Related

Rewrite for loop as while loop

I am learning python.
For the code below, how to convert for loop to while loop in an efficient way?
import pandas as pd
transactions01 = []
file=open('raw-data1.txt','w')
file.write('HotDogs,Buns\nHotDogs,Buns\nHotDogs,Coke,Chips\nChips,Coke\nChips,Ketchup\nHotDogs,Coke,Chips\n')
file.close()
file=open('raw-data1.txt','r')
lines = file.readlines()
for line in lines:
items = line[:-1].split(',')
has_item = {}
for item in items:
has_item[item] = 1
transactions01.append(has_item)**
file.close()
data = pd.DataFrame(transactions01)
data.fillna(0, inplace = True)
data

Code :
i = 0
while i<len(lines):
items = lines[i][:-1].split(',')
has_item = {}
j = 0
while j<len(items):
has_item[items[j]]=1
j+=1
transactions01.append(has_item)
i+=1

It looks like you could just take your, use the csv module to parse the file as you've got an inconsistent number of rows per column, then turn it into a dataframe, use pd.get_dummies to get 0/1's per item present, then aggregate back to a row level to product your final output, eg:
import pandas as pd
import csv
with open('raw-data1.txt') as fin:
df = pd.get_dummies(pd.DataFrame(csv.reader(fin)).stack()).groupby(level=0).max()
Will give you a df of:
Buns Chips Coke HotDogs Ketchup
0 1 0 0 1 0
1 1 0 0 1 0
2 0 1 1 1 0
3 0 1 1 0 0
4 0 1 0 0 1
5 0 1 1 1 0
.. which you can then write back out as CSV if required.

python: combine multiple files into a matrix with 1 and 0

There are several files like this:
sample_a.txt containing:
a
b
c
sample_b.txt containing:
b
w
e
sample_c.txt containing:
a
m
n
I want to make a matrix of absence/presence like this:
a b c w e m n
sample_a 1 1 1 0 0 0 0
sample_b 0 1 0 1 1 0 0
sample_c 1 0 0 0 0 1 1
I know a dirty and dumb way how to solve it: make up a list of all possible letters in those files, and then iteratively comparing each line of each file with this 'library' fill in the final matrix by index. But I guess there's a smarter solution. Any ideas?
Upd:
the sample files can be of different length.

You can try:
import pandas as pd
from collections import defaultdict
dd = defaultdict(list) # dictionary where each value per key is a list
files = ["sample_a.txt","sample_b.txt","sample_c.txt"]
for file in files:
with open(file,"r") as f:
for row in f:
dd[file.split(".")[0]].append(row[0])
#appending to dictionary dd:
#KEY: file.split(".")[0] is file name without extension
#VALUE: row[0] is first character of line in text file
# (second character was new line '\n' so I removed it)
df = pd.DataFrame.from_dict(dd, orient='index').T.melt() #converting dictionary to long format of dataframe
pd.crosstab(df.variable, df.value) #make crosstab, similar to pd.pivot_table
result:
value a b c e f m n o p w
variable
sample_a 1 1 1 0 0 0 0 0 0 0
sample_b 0 1 0 1 1 0 0 0 0 1
sample_c 1 0 0 0 0 1 1 1 1 0
Please note letters (columns) are in alphabetical order.

Python read from file that has multiple values

Ben
5 0 0 0 0 0 0 1 0 1 -3 5 0 0 0 5 5 0 0 0 0 5 0 0 0 0 0 0 0 0 1 3 0 1 0 -5 0 0 5 5 0 5 5 5 0 5 5 0 0 0 5 5 5 5 -5
Moose
5 5 0 0 0 0 3 0 0 1 0 5 3 0 5 0 3 3 5 0 0 0 0 0 5 0 0 0 0 0 3 5 0 0 0 0 0 5 -3 0 0 0 5 0 0 0 0 0 0 5 5 0 3 0 0
Reuven
I was wondering how to read multiple lines of this sort of file into a list or dictionary as I want the ratings which are the numbers to stay with the names of the person that corresponds with the rating

You could read the file in pairs of two lines and populate a dictionary.
path = ... # path to your file
out = {}
with open(path) as f:
# iterate over lines in the file
for line in f:
# the first, 3rd, ... line contains the name
name = line
# the 2nd, 4th, ... line contains the ratings
ratings = f.next() # by calling next here, we jump two lines per iteration
# write values to dictionary while using strip to get rid of whitespace
out[name.strip()] = [int(rating.strip()) for rating in ratings.strip().split(' ')]
It could also be done with a while loop:
path = ... # path to your file
out = {}
with open(path) as f:
while(True):
# read name and ratings, which are in consecutive lines
name = f.readline()
ratings = f.readline()
# stop condition: end of file is reached
if name == '':
break
# write values to dictionary:
# use name as key and convert ratings to integers.
# use strip to get rid of whitespace
out[name.strip()] = [int(rating.strip()) for rating in ratings.strip().split(' ')]

You can use zip to combine lines by pairs to form the dictionary
with open("file.txt","r") as f:
lines = f.read().split("\n")
d = { n:[*map(int,r.split())] for n,r in zip(lines[::2],lines[1::2]) }

Python How to count a series with multiple items in one line

f = open("routeviews-rv2-20181110-1200.pfx2as", 'r')
#read file into array, ignore first 6 lines
lines = loadtxt("routeviews-rv2-20181110-1200.pfx2as", dtype='str',
delimiter="\t", unpack=False)
#convert to dataframe
df = pd.DataFrame(lines,columns=['IPPrefix', 'PrefixLength', 'AS'])
series = df['AS'].astype(str).str.replace('_', ',').str.split(',')
arr = numpy.array(list(chain.from_iterable(series)))
ASes= pd.Series(numpy.bincount(arr))
ValueError: invalid literal for int() with base 10: '31133_65500,65501'
I want to count each time an item appears in col AS. However some lines have multiple entries that need to be counted.
Refer to: Python Find max in dataframe column to loop to find all values
Txt file: http://data.caida.org/datasets/routing/routeviews-prefix2as/2018/11/
But that cannot count line 67820 below.
Out[94]: df=
A B C
0 1.0.0.0 24 13335
1 1.0.4.0 22 56203
2 1.0.4.0 24 56203
3 1.0.5.0 24 56203
... ... ...
67820 1.173.142.0 24 31133_65500,65501
... ... ...
778719 223.255.252.0 24 58519
778720 223.255.254.0 24 55415
The _ is not a typo, that is how it appears in the file.
Desired output.
1335 1
... ..
31133 1
... ..
55415 1
... ..
56203 3
... ..
58159 1
... ..
65500 1
65501 1
... ..

replace + split + chain
You can replace _ with ,, split and then chain before using np.bincount:
from itertools import chain
series = df['A'].astype(str).str.replace('_', ',').str.split(',')
arr = np.array(list(chain.from_iterable(series))).astype(int)
print(pd.Series(np.bincount(arr)))
0 0
1 0
2 2
3 4
4 1
5 6
6 1
7 0
8 0
9 0
10 1
dtype: int64

Python from matrix to simple text file

I would like to ask you how to change file looks like this:
123 111 1
146 204 2
178 398 1
...
...
First column is x, second is y and the third mean the number in each square.
My matrix is 400x400 dimension. I would like to change it to the simple file
M file doesn't posses every square (for example 0 0 doesn't exist which mean that in output file i would like to have 0 in first row in first place.
My output file should look like this
0 0 1 0 0 0 1 0 7 9 3 0 2 0 ...
8 0 0 1 0 0 0 0 0 0 0 0 0 0 ...
7 8 9 0 7 5 0 0 3 2 4 5 5 7 ...
...
...
How can I change my file?
From first file i would like to reah second file. Like text file with 400lines each 400 characters splited by " " (blankspace).

Just initialize your matrix as as a list of list of zeros, and then iterate the lines in the file and set the values in the matrix accordingly. Cells that are not in the file will remain unchanged.
matrix = [[0 for i in range(400)] for k in range(400)]
with open("filename") as data:
for row in data:
(x, y, n) = map(int, row.split())
matrix[x][y] = n
Finally, write that matrix to another file:
with open("outfile", "w") as outfile:
for row in matrix:
outfile.write(" ".join(map(str, row)) + "\n")
You could also use numpy:
matrix = numpy.zeros((4,4), dtype=numpy.int8)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracting a column from the text file containing headers and delimiters - python

Related

Rewrite for loop as while loop

python: combine multiple files into a matrix with 1 and 0

Python read from file that has multiple values

Python How to count a series with multiple items in one line

Python from matrix to simple text file

Categories

Resources