Pandas modify column vlaue to new per defined requirement - python

below is part of my data, currently, there's a requirement to change the old_data to the required one. Just use the below data for example.
df = pd.DataFrame({'old_data':['12-13A:A','12-13A:B','12-13A:C','12-13A:G','39-40:A','39-40:B','39-40:G','13A-19:A','13A-19:B',
'13A-19:C','13A-19:D','13A-19:E','13A-19:F','13A-19:G']})
The pre-defined rule is that the digit range difference of each group's old data is 2(like 39-40),3(like 12-13A), or 6(like 13A-19). And if the single digit of a number is 4, then we need to change it to the number before that number appending an 'A'. For example, the number 14, we need to change it to 13A, 23A means 24. If the old_data is 33-35:B, the required data shall be Bed 33A.
Appreciate you can give some ideas for how to modify the old_data column to the required_data column with Pandas, Thanks.

Essentially your data is range:alphabet_index.
Some helper functions. I will switch between your 'no-four-in-the-last-digit' integer system and the normal integer system
import re
def to_number_system(s):
return int(re.sub('3A$', '4', s))
def to_no_four_system(n):
return 'Bed ' + re.sub('4$', '3A', str(n))
The following function map your alphabetical indices to the Bed numbers generated by the range, or G to the range itself.
def do_the_job(df):
_range = df['_range'].iloc[0]
range_start, range_end = map(to_number_system, _range.split('-'))
numbers = map(to_no_four_system, range(range_start, range_end+1))
return df['index'].map(dict(zip('ABCDEF', numbers), G=_range))
df[['_range', 'index']] = df['old_data'].str.split(':', expand=True)
df['required_data'] = df.groupby('_range').apply(do_the_job).droplevel(0)
Take care of the formatting
df.drop(columns=['_range', 'index'])

Related

Python Replace values in list with dict

I have 2 variables I am trying to manipulate the data. I have a variable with a list that has 2 items.
row = [['Toyyota', 'Cammry', '3000'], ['Foord', 'Muustang', '6000']]
And a dictionary that has submissions
submission = {
'extracted1_1': 'Toyota', 'extracted1_2': 'Camry', 'extracted1_3': '1000',
'extracted2_1': 'Ford', 'extracted2_2': 'Mustang', 'extracted2_3': '5000',
'reportDate': '2022-06-01T08:30', 'reportOwner': 'John Smith'}
extracted1_1 would match up with the first value in the first item from row. extracted1_2 would be the 2nd value in the 1st item, and extracted2_1 would be the 1st value in the 2nd item and so on. I'm trying to update row with the corresponding submission and having a hard time getting it to work properly.
Here's what I have currently:
iter_bit = iter((submission.values()))
for bit in row:
i = 0
for bits in bit:
bit[i] = next(iter_bit)
i += 1
While this somewhat works, i'm looking for a more efficient way to do this by looping through the submission rather than the row. Is there an easier or more efficient way by looping through the submission to overwrite the corresponding value in row?
Iterate through submission, and check if the key is in the format extractedX_Y. If it does, use those as the indexes into row and assign the value there.
import re
regex = re.compile(r'^extracted(\d+)_(\d+)$')
for key, value in submissions.items():
m = regex.search(key)
if m:
x = int(m.group(1))
y = int(m.group(2))
row[x-1][y-1] = value
It seems you are trying to convert the portion of the keys after "extracted" to indices into row. To do this, first slice out the portion you don't need (i.e. "extracted"), and then split what remains by _. Then, convert each of these strings to integers, and subtract 1 because in python indices are zero-based.
for key, value in submission.items():
# e.g. key = 'extracted1_1', value = 'Toyota'
if not key.startswith("extracted"):
continue
indices = [int(i) - 1 for i in key[9:].split("_")]
# e.g. indices = [0, 0]
# Set the value
row[indices[0]][indices[1]] = value
Now you have your modified row:
[['Toyota', 'Camry', '1000'], ['Ford', 'Mustang', '5000']]
No clue if its faster but its a 2-liner hahaha
for n, val in zip(range(len(row) * 3), submission.values()):
row[n//3][n%3] = val
that said, i would probably do something safer in a work environment, like parsing the key for its index.

how to extract the digits from the center of a column of data in a dataframe in pandas?

I want to know how to extract the 3 digits from the center of the data from a column of a dataframe, which can change the number of digits.
To do this I used the function str[1:4], where '1' is the value of digits to the right and '4' is the value of digits to the left. However, the number of digits in the column to be evaluated can change, so I obtained the data of digits to the right and digits to the left by means of a mathematical calculation, and their number is in the columns called 'right' and ' left'.
I want such numbers to be able to be included in the 'str' statement like this: str[df['right']:df['left]]. However doing it this way results in null or 'NaN'.
I appreciate if you can advise me on how to resolve this question.
This is the code:
import pandas as pd
import numpy as np
seed = int(input("Write the seed = 3 digits")) #the seed must be 3 digits
digits = len(str(seed)) #The result will be 3 digits
meansq = pd.DataFrame()
meansq['ID'] = range(100)
meansq['xi'] = np.random.randint(0, 999, size=100)
meansq['xi2'] = pow(meansq['xi'],2)
meansq['length'] = meansq['xi2'].apply(lambda x : len(str(x)))
meansq['right'] = ((meansq['length']-digits)/2).astype(int)
meansq['left'] = (meansq['length']-meansq['right']).astype(int)
meansq['xi_2'] = meansq['xi2'].astype(str)
#meansq['center'] = meansq['xi_2'].str[means['right']:meansq['left']]
meansq['center'] = meansq['xi_2'].str[1:4]
meansq.head()
meansq['center'] = [str(i)[1:4] for i in meansq['xi_2']]

Counting the repeated values in one column base on other column

Using Panda, I am dealing with the following CSV data type:
f,f,f,f,f,t,f,f,f,t,f,t,g,f,n,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,t,t,nowin
t,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,nowin
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,win
For this part of the raw data, I was trying to return something like:
Column1_name -- t -- counts of nowin = 0
Column1_name -- t -- count of wins = 3
Column1_name -- f -- count of nowin = 2
Column1_name -- f -- count of win = 1
Based on this idea get dataframe row count based on conditions I was thinking in doing something like this:
print(df[df.target == 'won'].count())
However, this would return always the same number of "wons" based on the last column without taking into consideration if this column it's a "f" or a "t". In other others, I was hoping to use something from Panda dataframe work that would produce the idea of a "group by" from SQL, grouping based on, for example, the 1st and last column.
Should I keep pursing this idea of should I simply start using for loops?
If you need, the rest of my code:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king-pawn/kr-vs-kp.data"
df = pd.read_csv(url,names=[
'bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target'
])
features = ['bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target']
# number of lines
#tot_of_records = np.size(my_data,0)
#tot_of_records = np.unique(my_data[:,1])
#for item in my_data:
# item[:,0]
num_of_won=0
num_of_nowin=0
for item in df.target:
if item == 'won':
num_of_won = num_of_won + 1
else:
num_of_nowin = num_of_nowin + 1
print(num_of_won)
print(num_of_nowin)
print(df[df.target == 'won'].count())
#print(df[:1])
#print(df.bkblk.to_string(index=False))
#print(df.target.unique())
#ini_entropy = (() + ())
This could work -
outdf = df.apply(lambda x: pd.crosstab(index=df.target,columns=x).to_dict())
Basically we are going in on each feature column and making a crosstab with target column
Hope this helps! :)

How to append data to a dataframe whithout overwriting?

I'm new to python but I need it for a personal project. And so I have this lump of code. The function is to create a table and update it as necessary. The problem is that the table keeps being overwritten and I don't know why. Also I'm struggling with correctly assigning the starting position of the new lines to append, and that's why total (ends up overwritten as well) and pos are there, but I haven't figured out how to correctly use them. Any tips?
import datetime
import pandas as pd
import numpy as np
total ={}
entryTable = pd.read_csv("Entry_Table.csv")
newEntries = int(input("How many new entries?\n"))
for i in range(newEntries):
ID = input ("ID?\n")
VQ = int (input ("VQ?\n"))
timeStamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
entryTable.loc[i] = [timeStamp, ID, VQ]
entryTable.to_csv("Inventory_Table.csv")
total[i] = 1
pos = sum(total.values())
print(pos)
inventoryTable = pd.read_csv("Inventory_Table.csv", index_col = 0)
Your variable 'i' runs from index 0 to the number of 'newEntries'. When you add new data to row 'i' in your Pandas dataframe, you are overwriting existing data in that row. If you want to add new data, try 'n+i' where n is the initial number of entries. You can determine n with either
n = len(entryTable)
or
n = entryTable.shape[0]

Stacking Arrays in Numpy: Different behaviors between UNIX and Windows

Note: This is Python 2.7, not Py3
This is an updated attempt at asking an earlier question. You requested my complete code, and explanation of content, and example output files. I'll try my best to format this well.
This code is meant to take an input file from a fluorometric "plate reader" and convert the readings to DNA concentrations and masses. It then generates an output file organized according to an 8x12 plate scheme (standard for DNA/molecular work). Rows are labeled "A, B, C,...,H" and columns are labeled simply 1 - 12.
Based on user input, arrays need to be stacked to format the output. However, when arrays are stacked in UNIX (and either printed or written to an outfile), they are limited to the first character.
In other words, in Windows, if a number in the array is 247.5, it prints the full number. But in a UNIX environment (Linux/Ubuntu/MacOS), it becomes truncated to simply "2". A number that is -2.7 will print normally in Windows, but in UNIX simply prints as "-".
The complete code can be found below; note that the last chunk is the most relevant portion of the code:
#!/usr/bin/env python
Usage = """
plate_calc.py - version 1.0
Convert a series of plate fluorescence readings
to total DNA mass per sample and print them to
a tab-delimited output file.
This program can take multiple files as inputs
(separated by a space) and generates a new
output file for each input file.
NOTE:
1) Input(s) must be an exported .txt file.
2) Standards must be in columns 1 and 2, or 11
and 12.
3) The program assumes equal volumes across wells.
Usage:
plate_calc.py input.txt input2.txt input3.txt
"""
import sys
import numpy as np
if len(sys.argv)<2:
print Usage
else:
#First, we want to extract the values of interest into a Numpy array
Filelist = sys.argv[1:]
input_DNA_vol = raw_input("Volume of sample used for AccuClear reading (uL): ")
remainder_vol = raw_input("Remaining volume per sample (uL): ")
orientation = raw_input("Are the standards on the LEFT (col. 1 & 2), or on the RIGHT (col. 11 and 12)? ")
orientation = orientation.lower()
for InfileName in Filelist:
with open(InfileName) as Infile:
fluor_list = []
Linenumber = 1
for line in Infile: #this will extract the relevant information and store as a list of lists
if Linenumber == 5:
line = line.strip('\n').strip('\r').strip('\t').split('\t')
fluor_list.append(line[1:])
elif Linenumber > 5 and Linenumber < 13:
line = line.strip('\n').strip('\r').strip('\t').split('\t')
fluor_list.append(line)
Linenumber += 1
fluor_list = [map(float, x) for x in fluor_list] #converts list items from strings to floats
fluor_array = np.asarray(fluor_list) #this takes our list of lists and converts it to a numpy array
This portion of the code (above) extracts the values of interest from an input file (obtained from the plate reader) and converts them to an array. It also takes user input to obtain information for calculations and conversions, and also to determine the columns in which standards are placed.
That last part comes into play later, when arrays are stacked - which is where the problematic behavior occurs.
#Create conditional statement, depending on where the standards are, to split the array
if orientation == "right":
#Next, we want to average the 11th and 12th values of each of the 8 rows in our numpy array
stds = fluor_array[:,[10,11]] #Creates a sub-array with the standard values (last two columns, (8x2))
data = np.delete(fluor_array,(10,11),axis=1) #Creates a sub-array with the data (first 10 columns, (8x10))
elif orientation == "left":
#Next, we want to average the 1st and 2nd values of each of the 8 rows in our numpy array
stds = fluor_array[:,[0,1]] #Creates a sub-array with the standard values (first two columns, (8x2))
data = np.delete(fluor_array,(0,1),axis=1) #Creates a sub-array with the data (last 10 columns, (8x10))
else:
print "Error: answer must be 'LEFT' or 'RIGHT'"
std_av = np.mean(stds, axis=1) #creates an array of our averaged std values
#Then, we want to subtract the average value from row 1 (the BLANK) from each of the 8 averages (above)
std_av_st = std_av - std_av[0]
#Run a linear regression on the points in std_av_st against known concentration values (these data = y axis, need x axis)
x = np.array([0.00, 0.03, 0.10, 0.30, 1.00, 3.00, 10.00, 25.00])*10 #ng/uL*10 = ng/well
xi = np.vstack([x, np.zeros(len(x))]).T #creates new array of (x, 0) values (for the regression only); also ensures a zero-intercept (when we use (x, 1) values, the y-intercept is not forced to be zero, and the slope is slightly inflated)
m, c = np.linalg.lstsq(xi, std_av_st)[0] # m = slope for future calculations
#Now we want to subtract the average value from row 1 of std_av (the average BLANK value) from all data points in "data"
data_minus_blank = data - std_av[0]
#Now we want to divide each number in our "data" array by the value "m" derived above (to get total ng/well for each sample; y/m = x)
ng_per_well = data_minus_blank/m
#Now we need to account for the volume of sample put in to the AccuClear reading to calculate ng/uL
ng_per_microliter = ng_per_well/float(input_DNA_vol)
#Next, we multiply those values by the volume of DNA sample (variable "ng")
ng_total = ng_per_microliter*float(remainder_vol)
#Set number of decimal places to 1
ng_per_microliter = np.around(ng_per_microliter, decimals=1)
ng_total = np.around(ng_total, decimals=1)
The above code performs the necessary calculations to figure out the concentration (ng/uL) and total mass (ng) of DNA in a given sample based on a linear regression of the DNA "standards," which can either be in columns 1 and 2 (user input = "left") or in columns 11 and 12 (user input = "right").
#Create a row array (values A-H), and a filler array ('-') to add to existing arrays
col = [i for i in range(1,13)]
row = np.asarray(['A','B','C','D','E','F','G','H'])
filler = np.array(['-','-','-','-','-','-','-','-','-','-','-','-','-','-','-','-',]).reshape((8,2))
The above code creates arrays to be stacked with the original array. The "filler" array is placed based on the user input of "right" or "left" (the stacking command, np.c_[ ], is seen below).
#Create output
Outfile = open('Total_DNA_{0}'.format(InfileName),"w")
Outfile.write("DNA concentration (ng/uL):\n\n")
Outfile.write("\t"+"\t".join([str(n) for n in col])+"\n")
if orientation == "left": #Add filler to left, then add row to the left of filler
ng_per_microliter = np.c_[filler,ng_per_microliter]
ng_per_microliter = np.c_[row,ng_per_microliter]
Outfile.write("\n".join(["\t".join([n for n in item]) for item in ng_per_microliter.tolist()])+"\n\n")
elif orientation == "right": #Add rows to the left, and filler to the right
ng_per_microliter = np.c_[row,ng_per_microliter]
ng_per_microliter = np.c_[ng_per_microliter,filler]
Outfile.write("\n".join(["\t".join([n for n in item]) for item in ng_per_microliter.tolist()])+"\n\n")
Outfile.write("Total mass of DNA per sample (ng):\n\n")
Outfile.write("\t"+"\t".join([str(n) for n in col])+"\n")
if orientation == "left":
ng_total = np.c_[filler,ng_total]
ng_total = np.c_[row,ng_total]
Outfile.write("\n".join(["\t".join([n for n in item]) for item in ng_total.tolist()]))
elif orientation == "right":
ng_total = np.c_[row,ng_total]
ng_total = np.c_[ng_total,filler]
Outfile.write("\n".join(["\t".join([n for n in item]) for item in ng_total.tolist()]))
Outfile.close
Finally, we have the generation of the output file. This is where the problematic behavior occurs.
Using a simple print command, I found that the stacking command numpy.c_[ ] is the culprit (NOT the array writing command).
So it appears that numpy.c_[ ] does not truncate these numbers in Windows, but will limit those numbers to the first character in a UNIX environment.
What are some alternatives that might work on both platforms? If none exists, I don't mind making a UNIX-specific script.
Thank you all for your help and your patience. Sorry for not providing all of the necessary information earlier.
The images are screenshots showing proper output from Windows and what I end up getting in UNIX (I tried to format these for you...but they were a nightmare). I have also included a screenshot of the output obtained in the terminal when I simply print the arrays "ng_per_microliter" and "ng_total."
Using a simple print command, I found that the stacking command numpy.c_[ ] is the culprit (NOT the array writing command).
So it appears that numpy.c_[ ] does not truncate these numbers in Windows, but will limit those numbers to the first character in a UNIX environment.
Illustrate these statements in simple examples. np.c_[] should not be doing anything different.
In Py3, where the default string type in unicode. And numpy 1.12
In [149]: col = [i for i in range(1,13)]
...: row = np.asarray(['A','B','C','D','E','F','G','H'])
...: filler = np.array(['-','-','-','-','-','-','-','-','-','-','-','-','-','-','-','-',]).reshape((8,2))
...:
In [150]: col
Out[150]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
In [151]: "\t"+"\t".join([str(n) for n in col])+"\n"
Out[151]: '\t1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t11\t12\n'
In [152]: filler
Out[152]:
array([['-', '-'],
...
['-', '-'],
['-', '-']],
dtype='<U1')
In [153]: row
Out[153]:
array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
dtype='<U1')
In [154]: row.shape
Out[154]: (8,)
In [155]: filler.shape
Out[155]: (8, 2)
In [159]: ng_per_microliter=np.arange(8.)+1.23
In [160]: np.c_[filler,ng_per_microliter]
Out[160]:
array([['-', '-', '1.23'],
['-', '-', '2.23'],
['-', '-', '3.23'],
...
['-', '-', '8.23']],
dtype='<U32')
In [161]: np.c_[row,ng_per_microliter]
Out[161]:
array([['A', '1.23'],
['B', '2.23'],
['C', '3.23'],
....
['H', '8.23']],
dtype='<U32')
It is possible that with earlier numpy versions that a concatenate of the U1 (or S1 in Py2) array with numeric values leaves the dtype at U1. In my example they've been expanded to U32.
So if you suspect the np.c_, display the result of those (with repr if needed)
print(repr(np.c_[row,ng_per_microliter]))
and track the dtype.
for v 1.12 release notes (possibly earlier)
The astype method now returns an error if the string dtype to cast to is not long enough in “safe” casting mode to hold the max value of integer/float array that is being casted. Previously the casting was allowed even if the result was truncated.
This might come into play when doing concatenate.
With the help of user hpaulj, I've figured out that this isn't an issue with different behavior between operating systems and environments. It's more than likely due to users having different versions of numpy.
The concatenating of arrays automatically converted 'float64' dtypes to 'S1' (to match the "filler" arrays ('-') and "row" arrays ('A','B', etc.)).
Newer versions of numpy - specifically, v 1.12.X - seem to allow the concatenation of arrays without this automatic conversion.
I'm still not sure of a way around this issue in older versions of numpy, but it should be a simple matter to advise folks to upgrade their version for full performance. :)

Categories

Resources