Update values in string based with values from pandas data frame

Update values in string based with values from pandas data frame - python

Given the following Data Frame:
df = pd.DataFrame({'term' : ['analys','applic','architectur','assess','item','methodolog','research','rs','studi','suggest','test','tool','viewer','work'],
'newValue' : [0.810419, 0.631963 ,0.687348, 0.810554, 0.725366, 0.742715, 0.799152, 0.599030, 0.652112, 0.683228, 0.711307, 0.625563, 0.604190, 0.724763]})
df = df.set_index('term')
print(df)
newValue
term
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763
I am trying to update values in this string behind each "^" with the values from the Data Frame.
(analysi analys^0.8046919107437134 studi^0.6034331321716309 framework methodolog^0.7360332608222961 architectur^0.6806665658950806)^0.0625 (recommend suggest^0.6603200435638428 rs^0.5923488140106201)^0.125 (system tool^0.6207902431488037 applic^0.610009491443634)^0.25 (evalu assess^0.7828741073608398 test^0.6444937586784363)^0.5
Additionally, this should be done with regard to the corresponding word such that I get this:
(analysi analys^0.810419 studi^0.652112 framework methodolog^0.742715 architectur^0.687348)^0.0625 (recommend suggest^0.683228 rs^0.599030)^0.125 (system tool^0.625563 applic^0.631963)^0.25 (evalu assess^0.810554 test^0.711307)^0.5
Thanks in advance for helping!

The best way I could come up with does this in multiple stages.
First, take the old string and extract all the values that you want to replace. that can be done with a regular expression.
old_string = "(analysi analys^0.8046919107437134 studi^0.6034331321716309 framework methodolog^0.7360332608222961 architectur^0.6806665658950806)^0.0625 (recommend suggest^0.6603200435638428 rs^0.5923488140106201)^0.125 (system tool^0.6207902431488037 applic^0.610009491443634)^0.25 (evalu assess^0.7828741073608398 test^0.6444937586784363)^0.5"
pattern = re.compile(r"(\w+\^(0|[1-9]\d*)(\.\d+)?)")
# pattern.findall(old_string) returns a list of tuples,
# so we need to keep just the outer capturing group for each match.
matches = [m[0] for m in pattern.findall(old_string)]
print("Matches:", matches)
In the next part, we make two dictionaries. One is a dictionary of the prefix (word part, before ^) of the values to replace to the whole value. We use that to create the second dictionary, from the values to replace to the new values (from the dataframe).
prefix_dict = {}
for m in matches:
pre, post = m.split('^')
prefix_dict[pre] = m
print("Prefixes:", prefix_dict)
matches_dict = {}
for i, row in df.iterrows(): # df is the dataframe from the question
if i in prefix_dict:
old_val = prefix_dict[i]
new_val = "%s^%s" % (i, row.newValue)
matches_dict[old_val] = new_val
print("Matches dict:", matches_dict)
With that done, we can loop through the items in the old value > new value dictionary and replace all the old values in the input string.
new_string = old_string
for key, val in matches_dict.items():
new_string = new_string.replace(key, val)
print("New string:", new_string)

Related

Store smallest number from a list based on criteria

I have a list of file names as strings where I want to store, in a list, the file name with the minimum ending number relative to file names that have the same beginning denotation.
Example: For any file names in the list beginning with '2022-04-27_Cc1cPL3punY', I'd only want to store the file name with the minimum value of the number at the end. In this case, it would be the file name with 2825288523641594007, and so on for other beginning denotations.
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']

Given that your files would already be sorted in ascending order form your OS/file-manager, you can just find the first one from each common prefix
files = ['2022-04-27_Cc1a6yWpUeQ_2825282726106954381.jpg',
'2022-04-27_Cc1a6yWpUeQ_2825282726106985382.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641594007.jpg',
'2022-04-27_Cc1cPL3punY_2825288523641621697.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650051140.jpg',
'2022-04-27_Cc1cPL3punY_2825288523650168421.jpg',
'2022-04-27_Cc1cPL3punY_2825288523708854776.jpg',
'2022-04-27_Cc1cPL3punY_2825288523717189707.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832374568690.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383025904.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383101420.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832383164193.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832399945744.jpg',
'2022-04-27_Cc1dN3Rp0es_2825292832458472617.jpg']
prefix_old = None
prefix = None
for f in files:
parts = f.split('_', 2)
prefix = '_'.join(parts[:2])
if prefix != prefix_old:
value = parts[2].split('.')[0]
print(f'Min value with prefix {prefix} is {value}')
prefix_old = prefix
Output
Min value with prefix 2022-04-27_Cc1a6yWpUeQ is 2825282726106954381
Min value with prefix 2022-04-27_Cc1cPL3punY is 2825288523641594007
Min value with prefix 2022-04-27_Cc1dN3Rp0es is 2825292832374568690

It seems that the list of files you have is already sorted according to groups of prefixes, and then according to the numbers. If that's indeed the case, you just need to take the first path of each prefix group. This can be done easily with itertools.groupby:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", next(group))
If you can't rely that they are internally ordered, find the minimum of each group according to the number:
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))
And if you can't even rely that it's ordered by groups, just sort the list beforehand:
files.sort(key=lambda file: file.rsplit('_', 1)[0])
for key, group in groupby(files, lambda file: file.rsplit('_', 1)[0]):
print(key, "min:", min(group, key=lambda file: int(file.rsplit('_', 1)[1].removesuffix(".jpg"))))

If the same pattern is being followed, you can try to split each name by a separator (In your example '.' and '_'. Documentation on how split works here), and then sort that list by sorting a list of lists, as explained here. This will need to be done per each "ID", as I will call each group identifier's, so we'll first need to get the unique IDs, and then iterate them. After that, we can proceed with the splitting. By doing this, you'll get a list of lists with the complete file name in position 0, and the number from the suffix in position 1
prefix = list(set([pre.split('_')[1] for pre in names]))
names_split = []
for pre in prefix:
names_split.append([pre,[[name, name.split('.')[0].split('_')[2]] for name in names if name.split('_')[1] == pre]])
for i in range(len(prefix)):
names_split[i][1] =sorted(names_split[i][1], key=lambda x: int(x[1]))
print(names_split)
The file you need should be names_split[x][0][0] where x identifies each ID.
PS: If you need to find a particular ID, you can use
searched_index = [value[0] for value in names_split].index(ID)
and then names_split[searched_index][0][0]]
Edit: Changed the splitted characters order and added docs on split method
Edit 2: Added prefix grouping

Your best bet is probably to use the pandas library, it is very good at dealing with tabular data.
import pandas as pd
file_name_list = [] # Fill in here
file_name_list = [file_name[:-4] for file_name in file_name_list] # Get rid of .jpg
file_name_series = pd.Series(file_name_list) # Put the data in pandas
file_name_table = file_name_series.str.split("_", expand=True) # Split the strings
file_name_table.columns = ['date', 'prefix', 'number'] # Renaming for readability
file_name_table['number'] = file_name_table['number'].astype(int)
smallest_file_names = file_name_table.groupby(by=['date', 'prefix'])['number'].min()
smallest_file_names_list = smallest_file_names.to_list()
smallest_file_names_list = [file_name+'.jpg' for file_name in smallest_file_names_list] # Putting the .jpg back

Editing dictionary key names based on a specific value

After converting to a string represented dictionary in Python I am looking to edit some key names based on a particular value. Here's an example of the dictionary in string format:
s = '{"some.info": "ABC","more.info": "DEF","device.0.Id":"12345678", "device.0.Type":"DEVICE-X", ' \
'"device.0.Status":"ACTIVE", "device.1.Id":"123EFEF8", "device.1.Type":"DEVICE-Y", "device.1.Status":"NOT FOUND", ' \
'"device.2.Id":"ABCD4328", "device.2.Type":"DEVICE-Z", "device.2.Status":"SLEEPING", "other.info":"Hello", ' \
'"additional.info":"Hi Again",}'
I have a working method below, which converts the string into a dictionary, scans for key entries containing '.Type' and drops into a list a tuple of the key section to replace and what to replace it with. However the whole process seems too inefficient, is there a better way to do this?
I have key value pairs of interest in my dictionary like this:
'device.0.Type':'DEVICE-X'
'device.1.Type':'DEVICE-Y'
'device.2.Type':'DEVICE-Z'
What I am looking to do is change all Key name instances of device.X to the value given for key 'device.X.Type'.
For example:
'device.0.Id':'12345678', 'device.0.Type':'DEVICE-X', 'device.0.Status':'ACTIVE',
'device.1.Id':'123EFEF8', 'device.1.Type':'DEVICE-Y', 'device.1.Status':'NOT FOUND', etc
would become:
'DEVICE-X.Id':'12345678', 'DEVICE-X.Type':'DEVICE-X', 'DEVICE-X.Status':'ACTIVE',
'DEVICE-Y.Id':'123EFEF8', 'DEVICE-Y.Type':'DEVICE-Y', 'DEVICE-Y.Status':'NOT FOUND', etc
Basically I am looking to remove the ambiguity of 'device.X' with something that's easier to read based on the device type
Here's my longwinded version:
s = '{"some.info": "ABC","more.info": "DEF","device.0.Id":"12345678", "device.0.Type":"DEVICE-X", ' \
'"device.0.Status":"ACTIVE", "device.1.Id":"123EFEF8", "device.1.Type":"DEVICE-Y", "device.1.Status":"NOT FOUND", ' \
'"device.2.Id":"ABCD4328", "device.2.Type":"DEVICE-Z", "device.2.Status":"SLEEPING", "other.info":"Hello", ' \
'"additional.info":"Hi Again",}'
d = eval(s)
devs = []
for k, v in d.items():
if '.Type' in k:
devs.append((k.split('.Type')[0], v))
for item in devs:
if item[0] in s:
s = s.replace(item[0], item[1])
s = eval(s)
print(s)

You can solve this by loading the data as a json, then iterating over it:
import json
s = '{"some.info": "ABC","more.info": "DEF","device.0.Id":"12345678", "device.0.Type":"DEVICE-X", "device.0.Status":"ACTIVE", "device.1.Id":"123EFEF8", "device.1.Type":"DEVICE-Y", "device.1.Status":"NOT FOUND", "device.2.Id":"ABCD4328", "device.2.Type":"DEVICE-Z", "device.2.Status":"SLEEPING", "other.info":"Hello", "additional.info":"Hi Again"}'
# load the string to a dictionary
devices_data = json.loads(s)
device_names = {}
for key, value in devices_data.items():
if key.endswith("Type"):
# if the key looks like a device type, store the value
device_names[key.rpartition(".")[0]] = value
renamed_device_data = {}
for key, value in devices_data.items():
x = key.rpartition(".") # split the key apart
if x[0] in device_names: # check if the first part matches a device name
renamed_device_data[f"{device_names[x[0]]}.{x[2]}"] = value # add the new key to the renamed dictionary with the value
else:
renamed_device_data[key] = value # for non-matches, put them in as is
This could certainly be optimised, but it should work at least!

Is there a faster way to lookup dictionary indices?

I am trying to look up dictionary indices for thousands of strings and this process is very, very slow. There are package alternatives, like KeyedVectors from gensim.models, which does what I want to do in about a minute, but I want to do what the package does more manually and to have more control over what I am doing.
I have two objects: (1) a dictionary that contains key : values for word embeddings, and (2) my pandas dataframe with my strings that need to be transformed into the index value found for each word in object (1). Consider the code below -- is there any obvious improvement to speed or am I relegated to external packages?
I would have thought that key lookups in a dictionary would be blazing fast.
Object 1
embeddings_dictionary = dict()
glove_file = open('glove.6B.200d.txt', encoding="utf8")
for line in glove_file:
records = line.split()
word = records[0]
vector_dimensions = np.asarray(records[1:], dtype='float32')
embeddings_dictionary [word] = vector_dimensions
Object 2 (The slowdown)
no_matches = []
glove_tokenized_data = []
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
# the line below is the problem
idx = list(embeddings_dictionary.keys()).index(word)
except:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)

You've got a mapping of word -> np.array. It appears you want a quick way to map word to its location in the key list. You can do that with another dict.
no_matches = []
glove_tokenized_data = []
word_to_index = dict(zip(embeddings_dictionary.keys(), range(len(embeddings_dictionary))))
for doc in df['body'][:5]:
doc = doc.split()
ints = []
for word in doc:
try:
idx = word_to_index[word]
except KeyError:
idx = 400000 # unknown
no_matches.append(word)
ints.append(idx)
glove_tokenized_data.append(ints)

In the line you marked as a problem, you are first creating a list from the keys and then looking up the word in the list. You're doing this inside the loop so the first thing you could do is take this logic to the top of the block (outside the loop) to avoid repeated processing and second you're doing all this searching now on a list, not a dictionary.
Why not create another dictionary like this on top of the file:
reverse_lookup = { word: index for word, index in enumerate(embeddings_dictionary.keys()) }
and then use this dictionary to look up the index of your word. Something like this:
for word in doc:
if word in reverse_lookup:
ints.append(reverse_lookup[word])
else:
no_matches.append(word)

Extracting a numerical value from .csv files

I have a dataframe, with a column of pathnames. I can access these paths using:
for i, p in enumerate(df['path']):
I am now however looking to extract a value from each of these output files.
The csv file looks like:
# some values
# some values : some values
# some values : some values
# some values : some values
# some string : the value I want
# some string : some values
Is there a way of extracting this value and inserting it into my dataframe?
I believe regex would do the trick. I am just not sure of the exact way. I have some template code which looks like:
if re.match(r"something", p):
df = pd.read_csv(p)
df.iloc[i, value_column] = the value I want

Here is a solution to extract the value from the text/csv using the builtin split:
def get_value(string):
array = string.split(": ") # maybe without the white space
return array[0] if len(array) == 1 else array[1]
get_value('some values')
# 'some values'
get_value('some string : the value I want')
# 'the value I want'
Alternatively, using regex
re.sub(r'.*\:\s*(.*)', r'\1', 'some values')
# 'some values'
re.sub(r'.*\:\s*(.*)', r'\1', 'some string : the value I want')
# 'the value I want'

I was assisted with this question when asked in a more clear context.
for a line in a csv file.
if re.match('# some string\s*:\s*([^\n]+)', line):
number = re.match('# some string\s*:\s*([^\n]+)', line).group(1)

a loop that is suppose to write lines to a file isnt working

I have a very large file that looks like this:
[original file][1]
field number 7 (info) contains ~100 pairs of X=Y separated by ';'.
I first want to split all X=Y pairs.
Next I want to scan one pair at a time, and if X is one of 4 titles and Y is an int- I want to put them them in a dictionary.
After finishing going through the pairs I want to check if the dictionary contains all 4 of my titles, and if so, I want to calculate something and write it into a new file.
This is the part of my code which suppose to do that:
for row in reader:
m = re.split(';',row[7]) # split the info field by ';'
d = {}
nl = []
for c in m: # for each info field, split by '=', if it is one of the 4 fields wanted and the value is int- add it to a dict
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE') and type(t[1])==int:
d[t[0]] = t[1]
if 'AC_MALE' in d and 'AC_FEMALE' in d and 'AN_MALE' in d and 'AN_FEMALE' in d: # if the dict contain all 4 wanted fields- make a new line for the final file
total_ac = int(d['AC_MALE'])+ int(d['AC_FEMALE'])
total_an = int(d['AN_MALE'])+ int(d['AN_FEMALE'])
ac_an = total_ac/total_an
nl.extend([row[0],row[1],row[3],row[4],total_ac,total_an, ac_an])
writer.writerow(nl)
The code is running with no errors but isnt writing anything to the file.
Can someone figure out why?
Thanks!

type(t[1])==int is never true. t[1] is a string, always, because you just split that object from another string. It doesn't matter here if the string contains only digits and could be converted to a int.
Test if you can convert your string to an integer, and if that fails, just move on to the next. If it succeeds, add the value to your dictionary:
for c in m:
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE'):
try:
d[t[0]] = int(t[1])
except ValueError:
# string could not be converted, so move on
pass
Note that you don't need to use re.split(); use the standard str.split() method instead. You don't need to test if all keys are present in your dictionary afterwards, just test if the dictionary contains 4 elements, so has a length of 4. You can also simplify the code to test the key name:
for row in reader:
d = {}
for key_value in row[7].split(','):
key, value = key_value.split('=')
if key in {'AC_MALE', 'AC_FEMALE', 'AN_MALE', 'AN_FEMALE'}:
try:
d[key] = int(value)
except ValueError:
pass
if len(d) == 4:
total_ac = d['AC_MALE'] + d['AC_FEMALE']
total_an = d['AN_MALE'] + d['AN_FEMALE']
ac_an = total_ac / total_an
writer.writerow([
row[0], row[1], row[3], row[4],
total_ac, total_an, ac_an])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Update values in string based with values from pandas data frame - python

Related

Store smallest number from a list based on criteria

Editing dictionary key names based on a specific value

Is there a faster way to lookup dictionary indices?

Extracting a numerical value from .csv files

a loop that is suppose to write lines to a file isnt working

Categories

Resources