Specific string slicing - python

I have a large string array which i store as an nparray named np_base: np.shape(np_base)
Out[32]: (65000000, 1)
what i intend to do is to vertically slice the array in order to decompose it into multiple columns that i'll store later in an independant way, so i tried to loop over the row indexes and to append:
for i in range(65000000):
INCDN.append(np.base[i, 0][0:5])
but this trhows out a memory error.
Could anybody please help me out with this issue, i've been searching for days for an alternative way to slice the string array.
Thanks,

There are many ways to apply a function to a numpy array one of which is the following:
np_truncated = np.vectorize(lambda x: x[:5])(np_base)
Your approach with iterativly appending a list is usally the least perfomed solution in most contexts.
Alternatively, if you intent to work with many columns, you might want to use pandas.
import pandas as pd
df = pd.DataFrame(np_base, columns=["Raw"])
truncated = df.Raw.str.slice(0,5)

Related

Pandas - iterating DataFrame rows efficiently and getting values by column name

Iterating in Pandas is notoriously inefficient, and best avoid whenever possible (using apply for data manipulation, etc.). In my case, unfortunately, it is unavoidable.
While it is wildly known that the most efficient way to do this is itertuples, with that function accessing the column data using the str name of the tuple throws the following error:
TypeError: tuple indices must be integers or slices, not str
Some suggest that the solution to this problem is to just switch to iterrows, but as mentioned previously, this is not efficient.
How do I utilize itertuples, while still using the str name of the column to get its row value?
Essentially, one just needs to use the index of the required column instead. Since the first value in the tuple is the Index from the originating dataframe, one can use the column index from the original dataframe, and add one to account for the index.
df = pd.DataFrame(some_data)
col_idx = df.columns.get_loc('col name') + 1 # +1 to account for the tuple Index
for row in training_df.itertuples():
val = row[col_idx]
print(val)
This solution may not be the most elegant option, but it works :)

Assign images to the elements of a pandas dataframe in Python

I have a pandas dataframe, which one of the columns are images (single channel uint8 2d images in the numpy arrays format).
I am iterating thorugh the rows with iterrows(), and processing the images and I want to assing the results (other image, in the same format) to the elements of other column of the dataframe. I have a column for the images.
for index,row in df.iterrows():
image=df['image']
processed=process_image(image)
df.loc[index,'processed_image']=processed
However, when I try to use either .loc or .at (or .iloc, .iat), face an error like this (respective for .loc and .at):
ValueError: cannot set using a multi-index selection indexer with a different length than the value
ValueError: setting an array element with a sequence.
Probably loc and at are expecting a single value, they expect that arrays are meant to fill several indexes of the pandas dataframe. But I dont want that, I want the array as a single element.
I couldnt find the exact questino elsewhere in the internet. The closest I found as already initializing the dataframe with arrays elements by hand, not assingning in an iterrows.
Anyone know how to solve? Thanks in advance.
Try adding a new column as a function of the existing columns via the .apply() method e.g.
df['new_col'] = df.apply(lambda row: ..., axis=1)

Why does pandas df.values convert tuple into strings

I have a csv with 4000 over data, in which each cells contains a tuple which holds a specific coordinated. I will like to convert it to a numpy array to work with. I use pandas to convert it into a DataFrame before calling df.values. However after calling df.values, the tuple becomes a string "(x,y)", instead. Is it possible to prevent this happening? Thank you.
df = pd.read_csv(sample_data)
array = df.values
I think problem is from csv always get tuples as strings.
So need convert them:
import ast
df['col'] = df['col'].apply(ast.literal_eval)
Or if all columns are tuples:
df = df.applymap(ast.literal_eval)
It seems that you read the file from local path ?
My answer is use eval to change the string:
df.apply(lambda x:x.apply(eval))
Another way to change the data type after reading the csv:
df['col'].apply(tuple)

From pandas array without duplicates to another data structure?

I have a pandas dataframe and it has ~ 10k column values.
I want to get an array without duplicates, but also have properties such as lookup by index + it's sorted!
import pandas as pd
df = pd.read_csv('path',sep=';')
arr = []
for i in df[0].values:
if i not in arr:
d.append(i)
it actually is very time/memory consuming because of the iteration through 10k element array, then looking up if element is not already stored in a newly created array and afterwards appending an element if conditions are matched.
I know set has a properties such as there can not be duplicates, but I can not look up element easily by index + it can not be sorted.
May be there is another possible solution to it ?
You can use pandas.DataFrame.drop_duplicates for more information drop_duplicates()
You are looking for np.unique:
np.unique(df[0])
Or adapted in pandas as .unique():
df[0].unique()

Subsetting Pandas dataframe via column number

When I want to retrieve the jth+1 value from the column of a panda dataframe, I can write: df["column_name"].ix[j]
When I check the type of the above code, I get:
type(df["column_name"].ix[i]) #str
I want to write less lengthy code though subsetting by the index. So I write:
df[[i]].ix[j]
However, when I check the type, I get: pandas.core.series.Series
How I rewrite this for the indexical subsetting to produce a str?
The double subscripting does something else than what you seem to imply it does - it returns a DataFrame of the corresponding columns.
As far as I know, the shortest way to do what you're asking using column-row ordering is
df.iloc[:, j].ix[i]
(There's the shorter
df.icol(j).ix[i]
but it's deprecated.)
One way to do this is like so:
df.ix[i][j]
This is kind of funky though, because the first index is the row, and the second is the column, which is rather not pandas. More like matrix indexing than pandas indexing.

Categories

Resources