How to read in csv with no specific delimiter? - python

I have a problem. I have a csv file which has no "," as delimiter but is built as a common excel file.
# 2016-01-01: Prices/Volumes for Market
23-24 24,57
22-23 30,1
21-22 29,52
20-21 33,07
19-20 35,34
18-19 37,41
I am only interested in reading in the second column for e.g. 24,57 in the first line. The data has no header. How could I proceed here?
pd.read_csv(f,usecols = [2])
Does not work because I think there is no column identified. Thanks for your help!

May be it is not suitable to read it as CSV
try to use regular expression, process it line by line
https://docs.python.org/2/library/re.html
for example
import re
>>> re.search('(\d{2})-(\d{2}) (\d{2}),(\d{2})', "23-24 24,57").group(1)
'23'
>>> re.search('(\d{2})-(\d{2}) (\d{2}),(\d{2})', "23-24 24,57").group(2)
'24'
>>> re.search('(\d{2})-(\d{2}) (\d{2}),(\d{2})', "23-24 24,57").group(3)
'24'
>>> re.search('(\d{2})-(\d{2}) (\d{2}),(\d{2})', "23-24 24,57").group(4)
'57'
To read file line by line in python, read this:
How to read large file, line by line in python

Try this:
pd.read_csv(f, delim_whitespace=True, names=['desired_col_name'], usecols=[1])
alternatively you might want to use pd.read_fwf

Related

searching a specific values from one file in another file using nested for loop [duplicate]

This question already has an answer here:
Script skips second for loop when reading a file
(1 answer)
Closed 2 years ago.
I have two files, file A.txt has hundreds of rows of format (ip,mac) and file B.txt has hundreds of rows of format (mac). what I am looking for is to search the (macs) from file B in file A and if found to print the line (ip, mac) from file A. there are already more than 100 mac matches between the two files but with the code I wrote it returns only the first match.
below is my simple code
with open("B.txt", "r") as out_mac_file, open("A.txt", "r") as out_arp_file:
for x in out_mac_file:
for y in out_arp_file:
if x in y:
print(y)
Any idea what could be wrong in the code, or if there other ways to do that?
Edit: Adding the format of file A and file B
File B
64167f18cd3d
64167f18c77a
64167f067082
64167f0670b5
64167f067400
64167f0674e5
64167f06740d
File A
10.55.14.160,64167f869f18
10.55.20.59,64167f37a5f4
10.55.20.62,64167f8866e0
10.55.20.65,64167f8b4bd8
10.55.20.66,64167f372a72
10.55.20.67,64167f371436
If you are ok with using pandas (since your data is in coma separated format):
import pandas as pd
a = pd.read_csv("A.txt", header=None, names=["mac"])
b = pd.read_csv("B.txt", header=None, names=["ip","mac"])
for mac in a["mac"]:
result = b[b["mac"] == mac]
if len(result) > 0:
print (result)
Or just a oneliner instead of a loop:
b.merge(a, on="mac")

How to reconstruct and change structure of a dataset using python?

I have a dataset and I need to reconstruct some data from this dataset to a new style
My dataset is something like below (Stored in a file named train1.txt):
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
I need to convert to below style (I need to store in a new file as train.txt):
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
And other numbers ….
My python version is 2.7.13
My operating system is Ubuntu 14.04 LTS
I will appreciate you for any help.
Thank you so much.
I would suggest using regex (regular expressions). This might be a little overkill, but in the long run, knowing regex is super powerful.
import re
def return_no_commas(string):
regex = r'\d*'
matches = re.findall(regex, string)
for match in matches:
print(match)
numbers = """
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
"""
return_no_commas(numbers)
Let me explain what everything does.
import re
just imports regular expressions. The regular expression I wrote is
regex = r'\d*'
the "r" at the beginning says it's a regex and it just looks for any number (which is the "\d" part) and says it can repeat any number of times (which is the "*" part). Then we print out all the matches.
I saved your numbers in a string called numbers, but you could just as easily read in a file and worked with those contents.
You'll get something like:
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
2411228
2416802
2322710
2387437
2397274
2344681
2396522
2386676
2413824
2328225
2413833
2335374
2328594
497966
2384001
2372746
2386538
2348518
2380037
2374364
2352054
2377990
2367915
2412520
2348070
2356469
2353541
2413446
2391930
2366968
2364762
2347618
2396550
2370538
2393212
It sounds to me like your original data is separated by commas. However, you want the data separated by new-line characters (\n) instead. This is very easy to do.
def covert_comma_to_newline(rfilename, wfilename):
"""
rfilename -- name of file to read-from
wfilename -- name of file to write-to
"""
assert(rfilename != wfilename)
# open two files, one in read-mode
# the other in write-mode
rfile = open(rfilename, "r")
wfile = open(wfilename, "w")
# read the file into a string
rstryng = rfile.read()
lyst = rstryng.split(",")
# EXAMPLE:
# rstryng == "1,2,3,4"
# lyst == ["1", "2", "3", "4"]
# remove leading and trailing whitespace
lyst = [s.strip() for s in lyst]
wstryng = "\n".join(lyst)
wfile.writelines(wstryng)
rfile.close()
wfile.close()
return
covert_comma_to_newline("train1.txt", "train.txt")
# open and check the contents of `train.txt`
Since others have added answers, I will include one using numpy.
If you are ok using numpy, it is as simple as:
data = np.genfromtxt('train1.txt', dtype=int, delimiter=',')
If you want a list instead of numpy array,
data.tolist()
[2342728,
2414939,
2397722,
2386848,
2398737,
2367906,
2384003,
2399896,
....
]

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:
Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))
My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.
Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

Issue reading text file with pound sign

I was trying to read a tab-delimited text file like this:
1 2# 3
using:
test = genfromtxt('test2.txt', delimiter='\t', dtype = 'string', skip_header=0)
However, I get the output only of 1 and 2. The # acts like an ending character in the txt file. Is there any way to solve this if I want to read the pound sign as a string?
the_string.split('\t') should do the job if you don't have to use genfromtxt

Zeroes appearing when reading file (where aren't any)

When reading a file (UTF-8 Unicode text, csv) with Python on Linux, either with:
csv.reader()
file()
values of some columns get a zero as their first characeter (there are no zeroues in input), other get a few zeroes, which are not seen when viewing file with Geany or any other editor. For example:
Input
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
Output
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;0378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
See 378983561 > 0378983561
Reading with:
f = file('/home/foo/data.csv', 'r')
data = f.read()
split_data = data.splitlines()
lines = list(line.split(';') for line in split_data)
print data[51220][8]
>>> '0378983561' #should have been '478983561' (reads like this in Geany etc.)
Same result with csv.reader().
Help me solve the mystery, what could be the cause of this? Could it be related to encoding/decoding?
The data you're getting is a string.
print data[51220][8]
>>> '0478983561'
If you want to use this as an integer, you should parse it.
print int(data[51220][8])
>>> 478983561
If you want this as a string, you should convert it back to a string.
print repr(int(data[51220][8]))
>>> '478983561'
csv.reader treats all columns as strings. Conversion to the appropriate type is up to you as in:
print int(data[51220][8])

Categories

Resources