How to count characters across strings in a pandas column - python

I have a dataframe with the following structure:
prod_sec
A
AA
AAAAAAAAAAB
AAAABCCCAA
AACC
ABCCCBAC
df = pd.DataFrame({'prod_sec': ['A','AA','AAAAAAAAAAB','AAAABCCCAA','AACC','ABCCCBAC']})
Each string is a sequence made up of letters (A to C in this example).
I would like to create a list for each letter that counts the occurrences in each position down the entire pandas column.
For example in the first string A is only in the first position/index and it's not in the other locations.
In the second string the A in the first two positions and it's not in the other locations
In the third string the A has all the positions until the last one. Etc... I want a total count, for the column, by position. Here is an example for A:
A -> [1,0,0,0,0,0,0,0,0,0,0]
AA [1,1,0,0,0,0,0,0,0,0,0]
AAAAAAAAAAB -> [1,1,1,1,1,1,1,1,1,1,0]
AAAABCCCAA [1,1,1,1,0,0,0,0,0,0,1]
AACC [1,1,0,0,0,0,0,0,0,0,0]
ABCCCBAC -> [1,0,0,0,0,0,1,0,0,0,0]
so for A, I would want an output similar to the following... A [6,4,2,2,1,1,2,1,1,1,0]
In the end, I'm trying to get a matrix with a row for each character.
[6,4,2,2,1,1,2,1,1,1,0]
[0,1,0,0,1,1,0,0,0,0,1]
[0,0,1,1,0,1,2,0,0,0,0]

The following should work. You can adjust the result, depending on your exact needs (numpy array, data frame, dictionary, etc). Tell me if you need more help with that.
max_length=max([len(i) for i in df.prod_sec])
d={'A':[0]*max_length, 'B':[0]*max_length, 'C':[0]*max_length}
for i in df.prod_sec:
for k in range(len(i)):
d[i[k]][k]+=1
result=pd.DataFrame.from_dict(d, orient='index')

Related

selecting subsets of data in Pandas

I have a data set containing 5 rows × 1317 columns. attached you can se how the data set looks like. The header contains numbers which are wavelengths. However I only want to select the columns from a specific range of wavelength.
The wavelengths numbers which I am interested are stored in an array (c) with the size of 1 × 235.
How can I extract the desired columns according to the wavelength values stored in c?
If your array c only has values that are also a column heading (that is, c doesn't have any additional values), you may be able to just make it a list and use df[c], where c is that list.
For example, with what is shown in your current picture, you could do:
l = [102,105] # I am assuming that your column headings in the df are integers, not strings
df[l]
This will display those two columns. If you want it in somenew dataframe, then do something like df2 = pandas.Dataframe(df[l]) to If lwas 5 columns, it would show those 5 columns. And so if you can pass in your arrayc, (or make it into a list, probably by l = list(c)`), you'll get your columns
If your array has additional values that aren't necessarily columns in your dataframe, you'll need to make a sub-list of just those columns.
sub_c = list() #create a blank list that we will add to
c_list = list(c)
for column in df.columns:
if column in c_list: sub_c.append(column)
df[sub_c]
This will make that sublist that only has values that are column headers, and so you want be trying to view columns that don't exist.
Keep in mind that you'll need matching data-types between your c array and your column headers.

Remove last 2 digs if first 4 equal X or Y

I have a csv field column in string format that has between 4 and 6 digits in each element. If the first 4 digits equal [3372] or [2277] I want to drop the last 2 digits for the element so that only 3372 or 2277 remains. I don't want to alter the other elements.
I'm guessing some loops, if statements and mapping maybe?
How would I go about this? (Please be kind. By down rating peoples posts you are discouraging people from learning. If you want to help, take time to read the post, it isn't difficult to understand.)
Rather then using loops, and if your csv file is rather big, I suggest you use pandas DataFrames :
import pandas as pd
# Read your file, your csv will be read in a DataFrame format (which is a matrix)
df = pd.read_csv('your_file.csv')
# Define a function to apply to each element in your DataFrame:
def update_df(x):
if x.startswith('3372'):
return '3372'
elif x.startswith('2277'):
return '2277'
else:
return x
# Use applymap, which applies a function to each element of your DataFrame, and collect the result in df1 :
df1 = df.applymap(update_df)
print(df1)
On the contrary, if you have a small dataset you may use loops, as suggested above.
Since your values are still strings, I would use slicing to look at the first 4 chars. If they match, we'll chop the end off the string. Otherwise, we'll return the value unaltered.
Here's a function that should do what you want:
def fix_digits(val):
if val[:4] in ('3372', '2277'):
return val[:4]
return val
# Here you'll need the actual code to read your CSV file
for row in csv_file:
# Assuming your value is in the 6'th column
row[5] = fix_digits(row[5])

Numpy where matching two specific columns

I have a six column matrix. I want to find the row(s) where BOTH columns match the query.
I've been trying to use numpy.where, but I can't specify it to match just two columns.
#Example of the array
x = np.array([[860259, 860328, 861277, 861393, 865534, 865716], [860259, 860328, 861301, 861393, 865534, 865716], [860259, 860328, 861301, 861393, 871151, 871173],])
print(x)
#Match first column of interest
A = np.where(x[:,2] == 861301)
#Match second column on interest
B = np.where(x[:,3] == 861393)
#rows in both A and B
np.intersect1d(A, B)
#This approach works, but is not column specific for the intersect, leaving me with extra rows I don't want.
#This is the only way I can get Numpy to match the two columns, but
#when I query I will not actually know the values of columns 0,1,4,5.
#So this approach will not work.
#Specify what row should exactly look like
np.where(all([860259, 860328, 861277, 861393, 865534, 865716]))
#I want something like this:
#Where * could be any number. But I think that this approach may be
#inefficient. It would be best to just match column 2 and 3.
np.where(all([*, *, 861277, 861393, *, *]))
I'm looking for an efficient answer, because I am looking through a 150GB HDF5 file.
Thanks for your help!
If I understand you correctly,
you can use a little more advanced slicing, like this:
np.where(np.all(x[:,2:4] == [861277, 861393], axis=1))
this will give you only where these 2 cols are equal to [861277, 861393]

Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.
I have tried the following:
arr = df[['column1', 'column2']].values
thelist= []
for ix, iy in np.ndindex(arr.shape):
if arr[ix, iy] not in thelist:
thelist.append(edges[ix, iy])
This works but it is taking too long. The dataframe contains around 30 million rows.
Example:
column1 column2
1 adr1 adr2
2 adr1 adr2
3 adr3 adr4
4 adr4 adr5
Should generate the list with the values:
[adr1, adr2, adr3, adr4, adr5]
Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.
#ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))
You can use just np.unique(df) (maybe this is the shortest version).
Formally, the first parameter of np.unique should be an array_like object,
but as I checked, you can also pass just a DataFrame.
Of course, if you want just plain list not a ndarray, write
np.unique(df).tolist().
Edit following your comment
If you want the list unique but in the order of appearance, write:
pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()
Operation order:
reshape changes the source array into a single column.
Then a DataFrame is created, with default column name = 0.
Then [0] takes just this (the only) column.
drop_duplicates acts exactly what the name says.
And the last step: tolist converts to a plain list.

Pandas - iterate through dataframe and join if any element of a list matches any element of another list

I have two series, and every cell of both series contains a list of elements of random length. My goal is to perform a cross-join between these two series, but only join rows if at least one element of a series' cell's list matches an element of a cell's list in the other series.
For example:
series_a
0 [1geor, georg, eorge, orges, rgesq, gesqu, esq...
1 [1mark, marks, arksq, rksqu, ksqua, squar, qua...
2 [1prim, primr, rimro, imros, mrose, roses, ose...
3 [1shan, shank, hanka, ankar, nkars, karst, wew...
4 [1stka, stkat, tkath, katha, athar, thari, har...
series_b
0 [115br, 15bro, 5broa, broad, roadw, oadwa, adway]
1 [11par, 1park, parkp, arkpl, rkpla, kplac, place]
2 [125we, 25wes, 5west, west2, est25, st25t, t25th]
3 [135ma, 35mad, 5madi, madis, adiso, dison]
4 [135we, 35wes, 5west, west4, est41]
I want to check, for every row in series_a, if at least one element in a row = an element in a row of series_b, and if yes, join those rows together in a new dataframe.
So, looking at series_a's first row, checking if '1geor' exists in the 1st, 2nd, 3rd. etc. list of series_b; if TRUE, perform the join, and if FALSE, do not perform the join.
To clarify, the returned dataframe should have two columns, where the first column contains cells from series_a and the second column contains cells from series_b. For all rows in this dataframe, the list in the 1st column should have at least one element that can be found in the list of the 2nd column. For example:
returned_df
0 [115br, 15bro, 5broa] | [15bro, abcde, 12345, hello, world, test1]
1 [11par, 1park, parkp, arkpl, rkpla] | [parkp, broad]
2 [125we, 25wes, 5west, west2, est25, st25t, t25th] | [t25th, sadlf, 234lgk]
...
If an element in a row in series_a occurs in more than one row in series_b, all combinations of matching rows should appear in the final dataframe.
What is the most efficient Python code for this exercise? The code:
any(elem in b for elem in a)
easily answers this for two specific lists, but I want to iterate through both series in their entirety.
Thank you!
I would use list comprehension once you have object dtypes in your series, and pandas string methods and iterative methods are very slow.
elements = [(item, elem) for item in series_a.tolist()\
for elem in series_b.tolist()\
if bool(set(item).intersection(elem))]
df_final = pd.DataFrame(elements)

Categories

Resources