I have 2 variables I am trying to manipulate the data. I have a variable with a list that has 2 items.
row = [['Toyyota', 'Cammry', '3000'], ['Foord', 'Muustang', '6000']]
And a dictionary that has submissions
submission = {
'extracted1_1': 'Toyota', 'extracted1_2': 'Camry', 'extracted1_3': '1000',
'extracted2_1': 'Ford', 'extracted2_2': 'Mustang', 'extracted2_3': '5000',
'reportDate': '2022-06-01T08:30', 'reportOwner': 'John Smith'}
extracted1_1 would match up with the first value in the first item from row. extracted1_2 would be the 2nd value in the 1st item, and extracted2_1 would be the 1st value in the 2nd item and so on. I'm trying to update row with the corresponding submission and having a hard time getting it to work properly.
Here's what I have currently:
iter_bit = iter((submission.values()))
for bit in row:
i = 0
for bits in bit:
bit[i] = next(iter_bit)
i += 1
While this somewhat works, i'm looking for a more efficient way to do this by looping through the submission rather than the row. Is there an easier or more efficient way by looping through the submission to overwrite the corresponding value in row?
Iterate through submission, and check if the key is in the format extractedX_Y. If it does, use those as the indexes into row and assign the value there.
import re
regex = re.compile(r'^extracted(\d+)_(\d+)$')
for key, value in submissions.items():
m = regex.search(key)
if m:
x = int(m.group(1))
y = int(m.group(2))
row[x-1][y-1] = value
It seems you are trying to convert the portion of the keys after "extracted" to indices into row. To do this, first slice out the portion you don't need (i.e. "extracted"), and then split what remains by _. Then, convert each of these strings to integers, and subtract 1 because in python indices are zero-based.
for key, value in submission.items():
# e.g. key = 'extracted1_1', value = 'Toyota'
if not key.startswith("extracted"):
continue
indices = [int(i) - 1 for i in key[9:].split("_")]
# e.g. indices = [0, 0]
# Set the value
row[indices[0]][indices[1]] = value
Now you have your modified row:
[['Toyota', 'Camry', '1000'], ['Ford', 'Mustang', '5000']]
No clue if its faster but its a 2-liner hahaha
for n, val in zip(range(len(row) * 3), submission.values()):
row[n//3][n%3] = val
that said, i would probably do something safer in a work environment, like parsing the key for its index.
Related
I have a dataframe that I have created by hand. I am working on a code that copies the dataframe and concatenates the new dataframe to the end of the first one. For now, I need the code to look through each value of a column of the 'Name' dataframe that contains strings and if there is a number in the string, increase this number by 1. I need the number to be turned into an int so that I can create a function that will look through the dataframe and automatically add 1 to the largest number in the dataframe. An example:
import pandas as pd
data = {'ID': [1,2,3,4],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp']}
df = pd.DataFrame(data)
df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
Afterwards the new df looks like
data2 = {'ID': [1,2,3,4,5,6,7,8],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp','BN #2', 'HHC', 'A comp', 'B Comp']}
When I run this, I receive a 'NoneType' object is not subscriptable error. This makes sense because only the BN # row has a number and re.search returns None when the string parameters are not met, but I cannot figure out how to tell python to ignore the other rows.
EDIT
Only the first row each dataframe will increase by 1, so if there is an easier way where I do not use re.search, that is fine. I know there are a couple ways of doing this but I want to be able to always look through the string value of BN and increase it by 1 every time I run the code.
REGEX EDIT
df2['BaseName'] = [re.sub('\d', '', x) for x in df2['Name'].values]
df['BaseName'] = [re.sub('\d', '', x) for x in df['Name'].values]
df2['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df2['Name'].values]
# df2['SysNum'] = df2['Name'].get(r'(?<=#)\d').astype(int)
# df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
df['SysNum'] = df['Name'].str.contains('(?<=#)\d').astype(int)
m = re.search(r'(?<=#)\d', df2['Name'].iloc[0])
if m:
df2['SysNum'] = int(m.group(0)) + 1
n = re.search(r'(?<=#)\d', df['Name'].iloc[0])
if n:
df['SysNum'] = int(n.group(1)) + 1
new_names = df2['BaseName'].unique()
maxes2 = np.zeros((len(new_names), ))
for j in range(len(new_names)):
un2 = new_names[j]
maxes2[j] = df['SysNum'].loc[df['BaseName'] == un2].max()
df2['SysNum'].loc[df2['BaseName'] == un2] = np.linspace(1, len(df2['SysNum'].loc[df2['BaseName'] == un2]), len(df2['SysNum'].loc[df2['BaseName'] == un2]))
df2['SysNum'].loc[df2['BaseName'] == un2] += maxes2[j]
newnames2 = [s + '%d' % num for s,num in zip(df2['BaseName'].loc[df2['BaseName'] == un2].values, df2['SysNum'].loc[df2['BaseName'] == un2].values)]
df2['Name'].loc[df2['BaseName'] == un2] = newnames2
I have this code working for two dataframes and the numbering works out how I would like it to. The first two have a "Name-###" naming convention for all the rows in the dataframe. This allows the commented out re.search line at the top to run just fine. The next two dataframes I am working on are like the examples I put up earlier with the BN #1 and the rest of the names do not have a number. When I run the commented out re.search lines, the code tries to convert the NoneTypes to int and it cannot do that. When I run the code as is now, a new number is put on each and every row immediately following the name, but I need it to add a new number to the row with the #. So what I need and I am struggling with is a piece of code that looks through the dataframe, looks for a # sign, turns the number after the # sign into an int, a loop that looks for the max int and then adds 1 to that number, adds that new number onto the new dataframe, adds new dataframe onto the old one for a larger master list.
You can access the value on the first row of the Name column using df['Name'].iloc[0].
Thus, you can search for a sequence of digits after a # sign in that value using
m = re.search(r'#(\d+)', df['Name'].iloc[0])
if m:
df['SysNum'] = int(m.group(1)) + 1
Output:
>>> df
ID Name SysNum
0 1 BN #1 2
1 2 HHC 2
2 3 A comp 2
3 4 B Comp 2
I'm importing a CSV to a dictionary, where there are a number of houses labelled (I.E. 1A, 1B,...)
Rows are labelled containing some item such as 'coffee' and etc. In the table is data indicating how much of each item each house hold needs.
Excel screenshot
What I am trying to do it check the values of the key value pairs in the dictionary for anything that isn't blank (containing either 1 or 2), and then take the key value pair and the 'PRODUCT NUMBER' (from the csv) and append those into a new list.
I want to create a shopping list that will contain what item I need, with what quantity, to which household.
the column containing 'week' is not important for this
I import the CSV into python as a dictionary like this:
import csv
import pprint
from typing import List, Dict
input_file_1 = csv.DictReader(open("DATA CWK SHOPPING DATA WEEK 1 FILE B.xlsb.csv"))
table: List[Dict[str, int]] = [] #list
for row in input_file_1:
string_row: Dict[str, int] = {} #dictionary
for column in row:
string_row[column] = row[column]
table.append(string_row)
I found on 'geeksforgeeks' how to access the pair by its value. however when I try this in my dictionary, it only seems to be able to search for the last row.
# creating a new dictionary
my_dict ={"java":100, "python":112, "c":11}
# list out keys and values separately
key_list = list(my_dict.keys())
val_list = list(my_dict.values())
# print key with val 100
position = val_list.index(100)
print(key_list[position])
I also tried to do a for in range loop, but that didn't seem to work either:
for row in table:
if row["PRODUCT NUMBER"] == '1' and row["Week"] == '1':
for i in range(8):
if string_row.values() != ' ':
print(row[i])
Please, if I am unclear anywhere, please let me know and I will clear it up!!
Here is a loop I made that should do what you want.
values = list(table.values())
keys = list(table.keys())
new_table = {}
index = -1
for i in range(values.count("")):
index = values.index("", index +1)
new_table[keys[index]] = values[index]
If you want to remove those values from the original dict you can just add in
d.pop(keys[index]) into the loop
I have the following xlsx file that I need to work on:
I want to iterate through the dataframe and if the column ITEM CODE contains a dictionary key, I want to check on the same row if contains a dictionary value[0] (first position in the tuple) and if contains I want to insert dictionary value1 (second position in the tuple) into another column named SKU
Dataframe: #df3 = df2.append(df1)
catp = {"2755":(('24','002'),('25','003'),('26','003'),('27','004'),('28','005'),('29','006'),('30','007'),('31','008'),
('32','009'),('32','010'),('33','011'),('34','012'),('35','013'),('36','014')),
"2513":(('38','002'),('40','003'),('42','004'),('44','005'),('46','006'),('48','007'),('50','008'),('52','009'),
('54','010'))}
for i, row in df3.iterrows():
if catp.key() in df3['ITEM CODE'][i] and catp.value()[0] in df3['TG'][i]:
codmarime = catp.value()[1]
df3['SKU'][i] = '20'+df3['ITEM CODE'][i]+[i]+codmarime
else:
df3['SKU'][i] = '20'+df3['ITEM CODE'][i]+'???'
If 2755 and 24 found SKU = '202755638002'
If 2513 and 44 found SKU = '202513123005'
Output xlsx
As you failed to provide text data to create at least a fragment of your DataFrame,
I copied from your picture 3 rows, creating my test DataFrame:
df3 = pd.DataFrame(data=[
[ '1513452', 'AVRO D2', '685', 'BLACK/BLACK/ANTRACITE', '24', 929.95, '8052644627565' ],
[ '2513452', 'AVRO D2', '685', 'BLACK/BLACK/ANTRACITE', '21', 929.95, '8052644627565' ],
[ '2755126', 'AMELIA', 'Y17', 'DARK-DENIM', '24', 179.95, '8052644627565' ]],
columns=[ 'ITEM CODE', 'ITEM', 'COLOR', 'COLOR CODE', 'TG', 'PRICE', 'EAN' ])
Details:
The first row does not contain any of catp keys in ITEM CODE column.
The second row: ITEM CODE contains one of your codes (2513) but for TG
column no tuple saved under 2513 key contains first element == 21.
The third row: ITEM CODE contains one of your codes (2755), TG == 24
and among tuples saved under 2755 there is one == 24.
Then we have to define a couple of auxiliary functions:
def findContainedCodeAndVal(dct, str):
for eachKey in dct.keys():
if str.find(eachKey) >= 0:
return (eachKey, dct[eachKey])
else:
return (None, None)
This function attempts to find in dct a key contained in str.
It returns a 2-tuple containing the key found and associated value from dct.
def find2ndElem(tuples, str):
for tpl in tuples:
if tpl[0] == str:
return tpl[1]
else:
return ''
This function checks each tuple from tuples whether its first element
== str and returns the second element from this tuple.
And the last function to define is a function to be applied to each row
from your DataFrame. It returns the value to be saved in SKU column:
def fn(row):
ind = row.name # Read row index
iCode = row['ITEM CODE']
k, val = findContainedCodeAndVal(catp, iCode)
codmarime = ''
if k:
tg = row.TG
codmarime = find2ndElem(val, tg)
if codmarime == '':
codmarime = '???'
return f'20/{iCode}/{ind}/{codmarime}'
Note that it uses your catp dictionary.
For demonstration purposes, I introduced in the returned value additional
slashes, separating adjacent parts. In the target version remove them.
And the last thing to do is to compute SKU column of your DataFrame,
applying fn function to each row of df3 and saving the result under
SKU column:
df3['SKU'] = df3.apply(fn, axis=1)
When you print the DataFrame (containig my test data), SKU column will
contain:
20/1513452/0/???
20/2513452/1/???
20/2755126/2/002
I am unable to understand the question properly but just correcting the errors I see in your code:
if catp.key() in df3['ITEM CODE'][i] and catp.value()[0] in df3['TG'][i]:
This is incorrect.
I am taking a different approach should work if I understand the end-goal
for key in catp.keys():
xdf = df3.loc[(df3['SKU'].astype(str).contains(key)) & (df3['SKU'].astype(str).contains(catp[key][0])]
if len(xdf)>0:
for i, row in xdf.iterrows():
codmarime = catp[key][1]
df3.at[i,'SKU'] = '20'+row['ITEM CODE'][i]+[i]+codmarime
I have a list of dictionaries read in from csv DictReader that represent rows of a csv file:
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
I would like to create a new dictionary, where only unique ID's are stored. But I would like to only keep the row entry with the most recent date. Based on the above example, it would keep the row with date 2/2/18.
I was thinking of doing something like this, but having trouble translating the pseudocode in the else statement into actual python.
I can figure out the part of checking the two dates for which is more recent, but having the most trouble figuring out how I check the new list for the dictionary that contains the same id and then retrieving the date from that row.
Note: Unfortunately, due to resource constraints on our platform I am unable to use pandas for this project.
new_data = []
for row in rows:
if row['id'] not in new_data:
new_data.append(row)
else:
check the element in new_data with the same id as row['id']
if that element's date value is less recent:
replace it with the current row
else :
continue to next row in rows
You'll need a function to convert your date (as string) to a date (as date).
import datetime
def to_date(date_str):
d1, m1, y1 = [int(s) for s in date_str.split('/')]
return datetime.date(y1, m1, d1)
I assumed your date format is d/m/yy. Consider using datetime.strptime to parse your dates, as illustrated by Alex Hall's answer.
Then, the idea is to loop over your rows and store them in a new structure (here, a dict whose keys are the IDs). If a key already exists, compare its date with the current row, and take the right one. Following your pseudo-code, this leads to:
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
new_data = dict()
for row in rows:
existing = new_data.get(row['id'], None)
if existing is None or to_date(existing['date']) < to_date(row['date']):
new_data[row['id']] = row
If your want your new_data variable to be a list, use new_data = list(new_data.values()).
import datetime
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
def parse_date(d):
return datetime.datetime.strptime(d, "%d/%m/%y").date()
tmp_dict = {}
for row in rows:
if row['id'] not in tmp_dict.keys():
tmp_dict['id'] = row
else:
if parse_date(row['date']) > parse_date(tmp_dict[row['id']]):
tmp_dict['id'] = row
print tmp_dict.values()
output
[{'date': '2/2/18', 'foo': 'baz', 'id': '123'}]
Note: you can merge the two if to if row['id'] not in tmp_dict.keys() || parse_date(row['date']) > parse_date(tmp_dict[row['id']]) for cleaner and shorter code
Firstly, work with proper date objects, not strings. Here is how to parse them:
from datetime import datetime, date
rows = [{"id": "123", "date": "1/1/18", "foo": "bar"},
{"id": "123", "date": "2/2/18", "foo": "baz"}]
for row in rows:
row['date'] = datetime.strptime(row['date'], '%d/%m/%y').date()
(check if the format is correct)
Then for the actual task:
new_data = {}
for row in rows:
new_data[row['id']] = max(new_data.get(row['id'], date.min),
row['date'])
print(new_data.values())
Alternatively:
Here are some generic utility functions that work well here which I use in many places:
from collections import defaultdict
def group_by_key_func(iterable, key_func):
"""
Create a dictionary from an iterable such that the keys are the result of evaluating a key function on elements
of the iterable and the values are lists of elements all of which correspond to the key.
"""
result = defaultdict(list)
for item in iterable:
result[key_func(item)].append(item)
return result
def group_by_key(iterable, key):
return group_by_key_func(iterable, lambda x: x[key])
Then the solution can be written as:
by_id = group_by_key(rows, 'id')
for id_num, group in list(by_id.items()):
by_id[id_num] = max(group, key=lambda r: r['date'])
print(by_id.values())
This is less efficient than the first solution because it creates lists along the way that are discarded, but I use the general principles in many places and I thought of it first, so here it is.
If you like to utilize classes as much as I do, then you could make your own class to do this:
from datetime import date
rows = [
{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"},
{"id":"456","date":"3/3/18","foo":"bar"},
{"id":"456","date":"1/1/18","foo":"bar"}
]
class unique(dict):
def __setitem__(self, key, value):
#Add key if missing or replace key if date is newer
if key not in self or self[key]["date"] < value["date"]:
dict.__setitem__(self, key, value)
data = unique() #Initialize new class based on dict
for row in rows:
d, m, y = map(int, row["date"].split('/')) #Split date into parts
row["date"] = date(y, m, d) #Replace date value
data[row["id"]] = row #Set new data. Will overwrite same ids with more recent
print data.values()
Outputs:
[
{'date': datetime.date(18, 2, 2), 'foo': 'baz', 'id': '123'},
{'date': datetime.date(18, 3, 3), 'foo': 'bar', 'id': '456'}
]
Keep in mind that data is a dict that essentially overrides the __setitem__ method that uses IDs as keys. And the dates are date objects so they can be compared easily.
I have a set of tuples within a list in which I am trying to group the similar items together.
Eg.
[('/Desktop/material_design_segment/arc_01.texture', 'freshnel_intensity_3.0022.jpg'),
('/Desktop/material_design_segment/arc_01.texture', 'freshnel_intensity_4.0009.jpg'),
('/Desktop/material_design_segment/arc_08.texture', 'freshnel_intensity_8.0020.jpg'),
('/Desktop/material_design_segment/arc_05.texture', 'freshnel_intensity_5.0009.jpg'),
('/Desktop/material_design_filters/custom/phase_03.texture', 'rounded_viscosity.0002.jpg'),
('/Desktop/material_design_filters/custom/phase_03.texture', 'freshnel_intensity_9.0019.jpg')]
My results should return me:
'/Desktop/material_design_segment/arc_01.texture':
'freshnel_intensity_3.0022.jpg',
'freshnel_intensity_4.0009.jpg',
'/Desktop/material_design_segment/arc_08.texture':
'freshnel_intensity_8.0020.jpg'
'/Desktop/material_design_segment/arc_05.texture':
'freshnel_intensity_5.0009.jpg'
'/Desktop/material_design_filters/custom/phase_03.texture':
'rounded_viscosity.0002.jpg',
'freshnel_intensity_9.0019.jpg'
However, when I tried using my code as follows, it only returns me 1 item.
groups = defaultdict(str)
for date, value in aaa:
groups[date] = value
pprint(groups)
This is the ouput:
{'/Desktop/material_design_segment/arc_01.texture': 'freshnel_intensity_4.0009.jpg'
'/Desktop/material_design_filters/custom/phase_03.texture': 'freshnel_intensity_9.0019.jpg'
'/Desktop/material_design_segment/arc_08.texture': 'freshnel_intensity_8.0020.jpg'
'/Desktop/material_design_segment/arc_05.texture': 'freshnel_intensity_5.0009.jpg'}
Where am I doing it wrong?
You're assigning value to groups[date], which overwrites the previous value. You need to append it to a list.
groups = defaultdict(list)
for date, value in aaa:
groups[date].append(value)
You should append the values into a list as follows (based on your code):
groups = defaultdict(list)
for date, value in aaa:
groups[date].append(value)
print(groups)