How to extract all text between certain characters with Python re - python

I'm trying to extract all text between certain characters but my current code simply returns an empty list. Each row has a long text string that looks like this:
"[{'index': 0, 'spent_transaction_hash': '4b3e9741022d4', 'spent_output_index': 68, 'script_asm': '3045022100e9e2280f5e6d965ced44', 'value': Decimal('381094.000000000')}\n {'index': 1, 'spent_transaction_hash': '0cfbd8591a3423', 'spent_output_index': 2, 'script_asm': '3045022100a', 'value': Decimal('3790496.000000000')}]"
I just need the values for "spent_transaction_hash". For example, I'd like to create a new column that has a list of ['4b3e9741022d4', '0cfbd8591a3423']. I'm trying to extract the values between 'spent_transaction_hash': and the comma. Here's my current code:
my_list = []
for row in df['column']:
value = re.findall(r'''spent_transaction_hash'\: \(\[\'(.*?)\'\]''', row)
my_list.append(value)
This code simply returns a blank list. Could anyone please tell me which part of my code is wrong?

Is is what you're looking for? 'spent_transaction_hash'\: '([a-z0-9]+)'
Test: https://regex101.com/r/cnviyS/1

Since it looks like you already have a list of Python dict objects, but in string format, why not just eval it and grab the desired keys? of course with that approach you don't need the regex matching, which is what the question had asked.
from decimal import Decimal
v = """\
[{'index': 0, 'spent_transaction_hash': '4b3e9741022d4', 'spent_output_index': 68, 'script_asm': '3045022100e9e2280f5e6d965ced44', 'value': Decimal('381094.000000000')}\n {'index': 1, 'spent_transaction_hash': '0cfbd8591a3423', 'spent_output_index': 2, 'script_asm': '3045022100a', 'value': Decimal('3790496.000000000')}]\
"""
L = eval(v.replace('\n', ','))
hashes = [e['spent_transaction_hash'] for e in L]
print(hashes)
# ['4b3e9741022d4', '0cfbd8591a3423']

Related

Is it possible to change cell value by dictionaly in Pandas DataFrame by iteration over list in the cell

UPDATED
Pandas DataFram I have a column that contains a list like the below in cells
df_lost['Article]
out[6]:
37774 186-2, 185-3, 185-2
37850 358-1, 358-4
37927
38266 111-2
38409 111-2
38508
38519 185-1
41161 185-4, 357-1
42948 185-1
Name: Article, dtype: object
for each entry like '182-2', '111-2' etch I have a dictionary like
aDict = {'111-2': 'Text-1', '358-1': 'Text-2'.....}'
is it possible to iterate over the list in the df cells and change the value to the value of a key from the dictionary?
Expected result:
37774 ['Text 1, Text 2, Text -5']
....
I have tried to use the map function
df['Article'] = df['Article'].map(aDict)
but it doesn't work with the list in a cell. As a temp solution, I have created the dictionary
aDict = {'186-2, 185-3, 185-2': 'Test - 1, test -2, test -3".....}
this works but the number of combinations is extremely big
You need to split the string at the comma delimiters, and then look up each element in the dictionary. You also have to index the list to get the string out of the first element, and wrap the result string back into a list.
def convert_string(string_list, mapping):
items = string[0].split(', ')
new_items = [mapping.get(i, i) for i in items]
return [', '.join(new_items)]
df['Article'] = df['Article'].map(convert_string)
I would use a regex and str.replace here:
aDict = {'111-2': 'Text1', '358-1': 'Text 2'}
import re
pattern = '|'.join(map(re.escape, aDict))
df['Article'] = df['Article'].str.replace(pattern, lambda m: aDict[m.group()], regex=True)
NB. If the dictionary keys can overlap (ab/abc), then they should be sorted by decreasing length to generate the pattern.
Output:
Article
37774 186-2, 185-3, 185-2
37850 Text 2, 358-4
37927
38266 Text1
38409 Text1
38508
38519 185-1
41161 185-4, 357-1
42948 185-1

Convert the strings from index [1:] in the List into float. when I'm getting a value error: could not convert string into Float

I have a csv file in which the data is in string format. i need to convert the numbers in the row to float.I cannot use any module to read the csv file.The data in the csv file is as follows [['Argentina','23','24.5'......],['America','22.4','23.5'.....].......]
The code i wrote so far is:
with open('life.csv', 'r') as f:
lines=[line.rstrip() for line in f]# To remove /r/n
results = []
for line in lines:
words = line.split(',')
results.append(words)
print (results[1:])```
Though I didn't understand your question properly, from the title I assumed you wanted to know, if you have a list of string then how can you convert each string element of that list to a floating element.
If that's your question, here is an answer.
Assume you have a list ['12.23', '67.89', '90.12']
my_list = ['12.23', '67.89', '90.12']
for i in range(0, len(my_list)):
my_list[i] = float(my_list[i])
Now your list will look like this: [12.23, 67.89, 90.12]
Also please try to post your question with proper information. Otherwise, it's tough for us to understand it properly!
Assuming your list is all of the same format as the image you supplied (please don't do that...) (i.e. one 'proper' string, followed by a series of floats stored as strings, you could convert them like this:
list1 = [['A', '1.2', '2.3'],
['B', '4.5', '6.7']]
list2 = [[x[0]] + list(map(float, x[1:])) for x in list1]
This converts the floats and outputs:
[['A', 1.2, 2.3], ['B', 4.5, 6.7]]

How to find if a string is in a list or lists in python?

I am trying to process a csv file and want to extract the entire row if it contains a string and add it to an another brand new list. But my approach is giving me all the rows which contain that string whereas I want the unique string row. Let me explain it with an example:
I have the following list of lists:
myList = [['abc', 1, 3, 5, 6], ['abcx', 5, 6, 8, 9], ['abcn', 7, 12, 89, 23]]
I want to get the whole list which has the string 'abc'. I tried the following:
newList = []
for temp in myList:
if 'abc' in temp:
newList.append(temp)
But this gives me all the values, as 'abc' is a substring of all the other strings too which are in the strings. What is a cleaner approach to solve this problem?
Update:
I have a huge CSV file, which I am reading line by line using readlines() and I want to find the line which has "abc" gene and shove the whole line into a list. But when I do if 'abc' in , I get all the other strings which also have "abc" as a substring. How can I ignore the substrings.
From your comment on the question, I think it is straight forward to use numpy and pandas if you want to process a csv file. Pandas has in-built csv reader and you can extract the row and convert into a list or a numpy array in a couple of lines with ease. Here's how I would do it:
import pandas
df = pandas.read_csv("your_csv")
#assuming you have column names.
x = df.loc[df['col_name'] == 'abc'].values.tolist() #this will give you the whole row and convert into a list.
Or
import numpy as np
x = np.array(df.loc[df['col_name'] == 'abc']) #gives you a numpy array
This gives you much more flexibility to do processing. I hope this helps.
It seems you want to append only if the string matches 'abc' and nothing else(e.g. true for 'abc, but false for 'abcx'). Is this correct?
If so, you need to make 2 corrections;
First, you need to index the list, currently temp is the entire list, but if you know the string will always be in position 0, index that in the if statement.(if you don't, either a nested for loop will work)
Second, you need to use '==' instead of 'in'. 'in' means that it can be a part of a larger string, whereas '==' must be an exact match.
newList = []
for temp in myList:
if temp[0] == 'abc':
newList.append(temp)
or
newList = [temp for temp in myList if temp[0] == 'abc']
Your code works, as others said it before me.
Part of your question was to get a cleaner code. Since you only want the sub-lists that contain your string, I would recommend to use filter:
check_against_string = 'abc'
newList = list(filter(lambda sub_list: check_against_string in sub_list, myList))
filter creates a list of elements for which a function returns true. It is exactly the code you wrote, but more pythonic!

String replace with multiple items

I have two pandas dataframes. One contains text, the other a set of terms i'd like to search for and replace within the text. I have created a loop which is able to replace each word in the text with a term however it's very slow, especially given that it is working over a large corpus.
My question is:
Is there a more efficient solution that replicates my method below?
Example text dataframe:
d = {'ID': [1, 2, 3], 'Text': ['here is some random text', 'random text here', 'more random text']}
text_df = pd.DataFrame(data=d)
Example terms dataframe:
d = {'Replace_item': ['<RANDOM_REPLACED>', '<HERE_REPLACED>', '<SOME_REPLACED>'], 'Text': ['random', 'here', 'some']}
replace_terms_df = pd.DataFrame(data=d)
Example of current solution:
def find_replace(text, terms):
for _, row in terms.iterrows():
term = row['Text']
item = row['Replace_item']
text.Text = text.Text.str.replace(term, item)
return text
find_replace(text_df, replace_terms_df)
Please let me know if anything above requires clarifying. Thank you,
Using zip + str.replace on the three columns, and assigning the results to the column at once, reduced the time by 50% (~400us to ~200us using %timeit):
text_df['Text'] = [z.replace(x, y) for (x, y, z) in zip(replace_terms_df.Text, replace_terms_df.Replace_item, text_df.Text)]

Python - Format file to list

I've read some questions on how to pass a file to a list, but this file it's a bit trickier as there are different data types in it, all of them represented as a string in that file.
I've managed to get a file that looks like this:
['verb', 0, 5, 7]['noun', 9, 3, 4]
How can I turn this into a list that looks like:
list = [['verb', 0, 5, 7], ['noun', 9, 3, 4]]
where 'verb' and 'noun' are strings and all numbers are integers.
You can try this:
data = open('file.txt').read().strip('\n')
import re
lists = re.findall("\[(.*?)\]", data)
final_list = [[int(i) if i.isdigit() else i[1:-1] for i in b.split(", ")] for b in lists]
If you mean that the file consists of groups enclosed in square brackets, all in one line, then probably the best idea is to replace all ][ sequences by ],[ sequences so that your data becomes valid JSON, then parse it using json.loads:
import json
with open('myfile', 'r') as f:
line = f.readline().rstrip()
list_of_lists = json.loads("[" + line.replace('][', '],[').replace("'", '"') + "]")
You could use regex for this task. Below you can find a sketch of how to apply this technique here:
import re
s = "['verb', 0, 5, 7]['noun', 9, 3, 4]"
#create the regex expression
pattern = re.compile(r'\[(.*?)\]')
#store the data here
result = []
#get every item entry using the regex expression
for x in re.findall(pattern, s):
z = x.split(",")
#parse the data entries
result.append([z[0].replace("'", ""), int(z[1]), int(z[2]), int(z[3])])
result
>>>[['verb', 0, 5, 7], ['noun', 9, 3, 4]]
I would open the file, then read it, store in a string and replace all "]" with "],". Then I'd eval (yeah I know it's evil but...) that string and would convert it to a list.
with open('your_file.txt', 'r') as raw_file:
your_str = raw_file.read()
your_str = your_str.replace('][', '],[')
your_list = list(eval(your_str))
If you know that the content of the file will be JSON, you can use json.loads instead of eval. In the example you gave above, you should convert ' to " in order to being valid JSON.
Follow this steps:
Identify who is storing data in this no format
Call them
Talk with this person about benefits to store data in an standard format. They are some available formats: csv, json, xml.
Ask them about to receive data in this format.
Deserialize data from file in one line library utility. Like `json.load('your_file.txt')

Categories

Resources