Python - Format file to list

Python - Format file to list - python

I've read some questions on how to pass a file to a list, but this file it's a bit trickier as there are different data types in it, all of them represented as a string in that file.
I've managed to get a file that looks like this:
['verb', 0, 5, 7]['noun', 9, 3, 4]
How can I turn this into a list that looks like:
list = [['verb', 0, 5, 7], ['noun', 9, 3, 4]]
where 'verb' and 'noun' are strings and all numbers are integers.

You can try this:
data = open('file.txt').read().strip('\n')
import re
lists = re.findall("\[(.*?)\]", data)
final_list = [[int(i) if i.isdigit() else i[1:-1] for i in b.split(", ")] for b in lists]

If you mean that the file consists of groups enclosed in square brackets, all in one line, then probably the best idea is to replace all ][ sequences by ],[ sequences so that your data becomes valid JSON, then parse it using json.loads:
import json
with open('myfile', 'r') as f:
line = f.readline().rstrip()
list_of_lists = json.loads("[" + line.replace('][', '],[').replace("'", '"') + "]")

You could use regex for this task. Below you can find a sketch of how to apply this technique here:
import re
s = "['verb', 0, 5, 7]['noun', 9, 3, 4]"
#create the regex expression
pattern = re.compile(r'\[(.*?)\]')
#store the data here
result = []
#get every item entry using the regex expression
for x in re.findall(pattern, s):
z = x.split(",")
#parse the data entries
result.append([z[0].replace("'", ""), int(z[1]), int(z[2]), int(z[3])])
result
>>>[['verb', 0, 5, 7], ['noun', 9, 3, 4]]

I would open the file, then read it, store in a string and replace all "]" with "],". Then I'd eval (yeah I know it's evil but...) that string and would convert it to a list.
with open('your_file.txt', 'r') as raw_file:
your_str = raw_file.read()
your_str = your_str.replace('][', '],[')
your_list = list(eval(your_str))
If you know that the content of the file will be JSON, you can use json.loads instead of eval. In the example you gave above, you should convert ' to " in order to being valid JSON.

Follow this steps:
Identify who is storing data in this no format
Call them
Talk with this person about benefits to store data in an standard format. They are some available formats: csv, json, xml.
Ask them about to receive data in this format.
Deserialize data from file in one line library utility. Like `json.load('your_file.txt')

Related

Pandas read_csv is retriving different data than what is in the text file

I have a .txt (notepad) file called Log1. It has the following saved in it: [1, 1, 1, 0]
When I write a program to retrieve the data:
Log1 = pd.read_csv('Path...\\Log1.txt')
Log1 = list(Log1)
print(Log1)
It prints: ['[1', ' 1', ' 1.1', ' 0]']
I dont understand where the ".1" is coming from on the third number. Its not in the text file, it just adds it.
Funny enough if I change the numbers in the text file to: [1, 0, 1, 1]. It does not add the .1 It prints ['[1', ' 0', ' 1', ' 1]']
Very odd why its acting this way if anyone has an idea.

Well, I worked out some other options as well, just for the record:
Solution 1 (plain read - this one gets a list of string)
log4 = []
with open('log4.txt') as f:
log4 = f.readlines()
print(log4)
Solution 2 (convert to list of ints)
import ast
with open('log4.txt', 'r') as f:
inp = ast.literal_eval(f.read())
print(inp))
Solution 3 (old school string parsing - convert to list of ints, then put it in a dataframe)
with open('log4.txt', 'r') as f:
mylist = f.read()
mylist = mylist.replace('[','').replace(']','').replace(' ','')
mylist = mylist.split(',')
df = pd.DataFrame({'Col1': mylist})
df['Col1'] = df['Col1'].astype(int)
print(df)
Other ideas here as well:
https://docs.python-guide.org/scenarios/serialization/
In general the reading from the text file (deserializing) is easier if the text file is written in a good structured format in the first place - csv file, pickle file, json file, etc. In this case, using the ast.literal_eval() worked well since this was written out as a list using it's __repr__ format -- though honestly I've never done that before so it was an interesting solution to me as well :)

This should work. Can you please try this,
log2 = log1.values.tolist()
Output:
[['1'], ['1'], ['1'], ['0']]

Your data is not in a CSV format. In CSV you would rather have
1;1;0;1
or something similar.
If you have multiple lines like this, it might make sense to parse this as CSV, otherwise I'd rather parse it using a regexp and .split on the result.
Proposal: Add a bigger input example and your expected output.

How could I read a dictionary that contains a function from a text file?

I want to read a dictionary from a text file. This dictionary seems like {'key': [1, ord('#')]}. I read about eval() and literal_eval(), but none of those two will work due to ord().
I also tried json.loads and json.dumps, but no positive results.
Which other way could I use to do it?

So Assuming you read the text file in with open as a string and not with json.loads you could do some simple regex searching for what is between the parenthesis of ord e.g ord('#') -> #
This is a minimal solution that reads everything from the file as a single string then finds all instances of ord and places the integer representation in an output list called ord_. For testing this example myfile.txt was a text file with the following in it
{"key": [1, "ord('#')"],
"key2": [1, "ord('K')"]}
import json
import re
with open(r"myfile.txt") as f:
json_ = "".join([line.rstrip("\n") for line in f])
rgx = re.compile(r"ord\(([^\)]+)\)")
rgd = rgx.findall(json_)
ord_ = [ord(str_.replace(r"'", "")) for str_ in rgd]

json.dump() and json.load() will not work because ord() is not JSON Serializable (meaning that the function cannot be a JSON object.
Yes, eval is really bad practice, I would never recommend it to anyone for any use.
The best way I can think of to solve this is to use conditions and an extra list.
# data.json = {'key': [1, ['ord', '#']]} # first one is function name, second is arg
with open("data.json") as f:
data = json.load(f)
# data['key'][1][0] is "ord"
if data['key'][1][0] == "ord":
res = ord(data['key'][1][1])

How to find if a string is in a list or lists in python?

I am trying to process a csv file and want to extract the entire row if it contains a string and add it to an another brand new list. But my approach is giving me all the rows which contain that string whereas I want the unique string row. Let me explain it with an example:
I have the following list of lists:
myList = [['abc', 1, 3, 5, 6], ['abcx', 5, 6, 8, 9], ['abcn', 7, 12, 89, 23]]
I want to get the whole list which has the string 'abc'. I tried the following:
newList = []
for temp in myList:
if 'abc' in temp:
newList.append(temp)
But this gives me all the values, as 'abc' is a substring of all the other strings too which are in the strings. What is a cleaner approach to solve this problem?
Update:
I have a huge CSV file, which I am reading line by line using readlines() and I want to find the line which has "abc" gene and shove the whole line into a list. But when I do if 'abc' in , I get all the other strings which also have "abc" as a substring. How can I ignore the substrings.

From your comment on the question, I think it is straight forward to use numpy and pandas if you want to process a csv file. Pandas has in-built csv reader and you can extract the row and convert into a list or a numpy array in a couple of lines with ease. Here's how I would do it:
import pandas
df = pandas.read_csv("your_csv")
#assuming you have column names.
x = df.loc[df['col_name'] == 'abc'].values.tolist() #this will give you the whole row and convert into a list.
Or
import numpy as np
x = np.array(df.loc[df['col_name'] == 'abc']) #gives you a numpy array
This gives you much more flexibility to do processing. I hope this helps.

It seems you want to append only if the string matches 'abc' and nothing else(e.g. true for 'abc, but false for 'abcx'). Is this correct?
If so, you need to make 2 corrections;
First, you need to index the list, currently temp is the entire list, but if you know the string will always be in position 0, index that in the if statement.(if you don't, either a nested for loop will work)
Second, you need to use '==' instead of 'in'. 'in' means that it can be a part of a larger string, whereas '==' must be an exact match.
newList = []
for temp in myList:
if temp[0] == 'abc':
newList.append(temp)
or
newList = [temp for temp in myList if temp[0] == 'abc']

Your code works, as others said it before me.
Part of your question was to get a cleaner code. Since you only want the sub-lists that contain your string, I would recommend to use filter:
check_against_string = 'abc'
newList = list(filter(lambda sub_list: check_against_string in sub_list, myList))
filter creates a list of elements for which a function returns true. It is exactly the code you wrote, but more pythonic!

Look for pattern in a line and print following values in the brackets

I'm trying to extract some info from a file. The file has many lines like the one below
"names":["DNSCR"],"actual_names":["RADIO_R"],"castime":[2,4,6,8,10] ......
I want to search in each line for names and castime, if found I want to print the value in the brackets
the values in the brackets are changing in different line. for example in the above line names is DNSCR, and casttime is 2,3,6,8. but the length might
be different in next line
I have tried the following code but it will always give me 10 characters but I only need whatever in the bracket only.
c_req = 10
keyword = ['"names":','"castime":']
with open('mylogfile.log') as searchfile:
for line in searchfile:
for key in keywords:
left,sep,right = line.partition(key)
if sep:
print key + " = " + (right[:c_req])

This looks just like json, are there brackets around each line?
if so, the whole content is trivial to parse:
import json
test = '{"names":["DNSCR"],"actual_names":["RADIO_R"],"castime":[2,4,6,8,10]}'
result = json.loads(test)
print(result["names"], result["castime"])
You could also use a library like pandas to read the whole file into a dataframe if it matches a whole JSON file.

Use Regular Expression:
import re
# should contain all lines
lines = ['"names":["DNSCR"],"actual_names":["RADIO_R"],"castime":[2,4,6,8,10]']
# more efficient in large files
names_pattern = re.compile('"names":\["(\w+)"\]')
castime_pattern = re.compile('"castime":\[(.+)\],?')
names, castimes = list(), list()
for line in lines:
names.append(re.search(names_pattern, line).group(1))
castimes.append(
[int(num) for num in re.search(castime_pattern, line).group(1).split(',')]
)
add exception handling and file opening/reading

Given mylogfile.log:
"names":["DNSCR"],"actual_names":["RADIO_R"],"castime":[2,4,6,8,10]
"names":["FOO", "BAR"],"actual_names":["RADIO_R"],"castime":[1, 2, 3]
Using regular expressions and ast.literal_eval.
import ast
import re
keywords = ['"names":', '"castime":']
keywords_name = ['names', 'castime']
d = {}
with open('mylogfile.log') as searchfile:
for i, line in enumerate(searchfile):
d['line ' + str(i)] = {}
for key, key_name in zip(keywords, keywords_name):
d['line ' + str(i)][key_name] = ast.literal_eval(re.search(key + '\[(.*?)\]', line).group(1))
print(d)
#{ 'line 0': {'castime': (2, 4, 6, 8, 10), 'names': 'DNSCR'},
# 'line 1': {'castime': (1, 2, 3), 'names': ('FOO', 'BAR')}}
re.search(key + '\[(.*?)\]', line).group(1) will catch everything that is in between [] after your keys.
And ast.literal_eval() will transform remove usless quote and spaces in your string and automatically create tuples when needed.
I also used enumerate to keep track of which lines it gets in the log file.

How do you split a string at a specific point?

I am new to python and want to split what I have read in from a text file into two specific parts. Below is an example of what could be read in:
f = ['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]
So what I want to achieve is to be able to execute the second part of the program is:
words = ['Cats','like','dogs','as','much','cats.']
numbers = [1,2,3,4,5,4,3,2,6]
I have tried using:
words,numbers = f.split("][")
However, this removes the double bracets from the two new variable which means the second part of my program which recreates the original text does not work.
Thanks.

I assume f is a string like
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
then we can find the index of ][ and add one to find the point between the brackets
i = f.index('][')
a, b = f[:i+1], f[i+1:]
print(a)
print(b)
output:
['Cats','like','dogs','as','much','cats.']
[1,2,3,4,5,4,3,2,6]

Another Alternative if you want to still use split()
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
d="]["
print f.split(d)[0]+d[0]
print d[1]+f.split(d)[1]

If you can make your file look something like this:
[["Cats","like","dogs","as","much","cats."],[1,2,3,4,5,4,3,2,6]]
then you could simply use Python's json module to do this for you. Note that the JSON format requires double quotes rather than single.
import json
f = '[["Cats","like","dogs","as","much","cats."],[1,2,3,4,5,4,3,2,6]]'
a, b = json.loads(f)
print(a)
print(b)
Documentation for the json library can be found here: https://docs.python.org/3/library/json.html

An alternative to Patrick's answer using regular expressions:
import re
data = "f = ['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
pattern = 'f = (?P<words>\[.*?\])(?P<numbers>\[.*?\])'
match = re.match(pattern, data)
words = match.group('words')
numbers = match.group('numbers')
print(words)
print(numbers)
Output
['Cats','like','dogs','as','much','cats.']
[1,2,3,4,5,4,3,2,6]

If I understand correctly, you have a text file that contains ['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6] and you just need to split that string at the transition between brackets. You can do this with the string.index() method and string slicing. See my console output below:
>>> f = open('./catsdogs12.txt', 'r')
>>> input = f.read()[:-1] # Read file without trailing newline (\n)
>>> input
"['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
>>> bracket_index = input.index('][') # Get index of transition between brackets
>>> bracket_index
41
>>> words = input[:bracket_index + 1] # Slice from beginning of string
>>> words
"['Cats','like','dogs','as','much','cats.']"
>>> numbers = input[bracket_index + 1:] # Slice from middle of string
>>> numbers
'[1,2,3,4,5,4,3,2,6]'
Note that this will leave you with a python string that looks visually identical to a list (array). If you needed the data represented as python native objects (i.e. so that you can actually use it like a list), you'll need to use some combination of string[1:-1].split(',') on both strings and list.map() on the numbers list to convert the numbers from strings to numbers.
Hope this helps!

Another thing you can do is first replace ][ with ]-[ and then do a split or partition using - but i will suggest you to do split as we don't want that delimiter.
SPLIT
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
f = f.replace('][',']-[')
a,b = f.split('-')
Output
>>> print(a)
['Cats','like','dogs','as','much','cats.']
>>> print(b)
[1,2,3,4,5,4,3,2,6]
PARTITION
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
f = f.replace('][',']-[')
a,b,c = f.partition('-')
Output
>>> print(a)
['Cats','like','dogs','as','much','cats.']
>>> print(c)
[1,2,3,4,5,4,3,2,6]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Format file to list - python

You can try this: data = open('file.txt').read().strip('\n') import re lists = re.findall("\[(.*?)\]", data) final_list = [[int(i) if i.isdigit() else i[1:-1] for i in b.split(", ")] for b in lists]

Related

Pandas read_csv is retriving different data than what is in the text file

How could I read a dictionary that contains a function from a text file?

How to find if a string is in a list or lists in python?

Look for pattern in a line and print following values in the brackets

How do you split a string at a specific point?

Categories

Resources