From pandas array without duplicates to another data structure?

From pandas array without duplicates to another data structure? - python

I have a pandas dataframe and it has ~ 10k column values.
I want to get an array without duplicates, but also have properties such as lookup by index + it's sorted!
import pandas as pd
df = pd.read_csv('path',sep=';')
arr = []
for i in df[0].values:
if i not in arr:
d.append(i)
it actually is very time/memory consuming because of the iteration through 10k element array, then looking up if element is not already stored in a newly created array and afterwards appending an element if conditions are matched.
I know set has a properties such as there can not be duplicates, but I can not look up element easily by index + it can not be sorted.
May be there is another possible solution to it ?

You can use pandas.DataFrame.drop_duplicates for more information drop_duplicates()

You are looking for np.unique:
np.unique(df[0])
Or adapted in pandas as .unique():
df[0].unique()

Related

Removing all elements execpt the first one in list pandas dataframe

I have a pandas dataframe, in which in one column I have a list of hashtags. Now, I would like to delete all elements in that list expect the first one of each row.
Is there a way of doing this?

A simple way to do so:
df.hashtags = df.hashtags.map(lambda l: l[:1])

List of Series to Dataframe

I have a list having Pandas Series objects, which I've created by doing something like this:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
where input_df is a Pandas Dataframe
I want to convert this list of Series objects back to Pandas Dataframe object, and was wondering if there is some easy way to do it

Based on the post you can do this by doing:
pd.DataFrame(li)
To everyone suggesting pd.concat, this is not a Series anymore. They are adding values to a list and the data type for li is a list. So to convert the list to dataframe then they should use pd.Dataframe(<list name>).

Since the right answer has got hidden in the comments, I thought it would be better to mention it as an answer:
pd.concat(li, axis=1).T
will convert the list li of Series to DataFrame

It seems that you wish to perform a customized melting of your dataframe.
Using the pandas library, you can do it with one line of code. I am creating below the example to replicate your problem:
import pandas as pd
input_df = pd.DataFrame(data={'1': [1,2,3,4,5]
,'2': [1,2,3,4,5]
,'3': [1,2,3,4,5]
,'4': [1,2,3,4,5]
,'5': [1,2,3,4,5]})
Using pd.DataFrame, you will be able to create your new dataframe that melts your two selected lists:
li = []
li.append(input_df.iloc[0])
li.append(input_df.iloc[4])
new_df = pd.DataFrame(li)
if what you want is that those two lists present themselves under one column, I would not pass them as list to pass those list back to dataframe.
Instead, you can just append those two columns disregarding the column names of each of those columns.
new_df = input_df.iloc[0].append(input_df.iloc[4])
Let me know if this answers your question.

The answer already mentioned, but i would like to share my version:
li_df = pd.DataFrame(li).T

If you want each Series to be a row of the dataframe, you should not use concat() followed by T(), unless all your values are of the same datatype.
If your data has both numerical and string values, then the transpose() function will mangle the dtypes, likely turning them all to objects.
The right way to do this in general is:
Convert each series to a dict()
Pass the list of dicts either into the pd.Dataframe() constructor directly or use pd.Dataframe.from_dicts and set the orient keyword to "index."
In your case the following should work:
my_list_of_dicts = [s.to_dict() for s in li]
my_df = pd.Dataframe(my_list_of_dicts)

How to limit Pandas loc selection

I search Pandas DataFrame by loc -for example like this
x = df.loc[df.index.isin(['one','two'])]
But I need only the first row of the result. If I use
x = df.loc[df.index.isin(['one','two'])].iloc[0]
I get error in the case that no row is found. Of course, I can select all the rows (the first example) and then check if result is empty or not. But I seek some more efficient way (the dataframe can be long). Is there any?

pandas.Index.duplicated
The pandas.Index object has a duplicated method that identifies all repeated values after the first occurance.
x[~x.index.duplicated()]
If you wanted to ...
df[df.index.isin(['one', 'two']) & ~df.index.duplicated()]

Specific string slicing

I have a large string array which i store as an nparray named np_base: np.shape(np_base)
Out[32]: (65000000, 1)
what i intend to do is to vertically slice the array in order to decompose it into multiple columns that i'll store later in an independant way, so i tried to loop over the row indexes and to append:
for i in range(65000000):
INCDN.append(np.base[i, 0][0:5])
but this trhows out a memory error.
Could anybody please help me out with this issue, i've been searching for days for an alternative way to slice the string array.
Thanks,

There are many ways to apply a function to a numpy array one of which is the following:
np_truncated = np.vectorize(lambda x: x[:5])(np_base)
Your approach with iterativly appending a list is usally the least perfomed solution in most contexts.
Alternatively, if you intent to work with many columns, you might want to use pandas.
import pandas as pd
df = pd.DataFrame(np_base, columns=["Raw"])
truncated = df.Raw.str.slice(0,5)

Keep rows from a dataframe whose index name is NOT in a given list

So, I have a list with tuples, and a multi-index dataframe. I want to find the rows of the dataframe whose indices are NOT included in the list of tuples, and create a new dataframe with these elements. Any help? Thanx!

You can use isin with a negation to explicitly filter your DataFrame:
new_df = df[~df.index.isin(list_of_tuples)]
Alternatively, use drop to remove the tuples you don't want to be included in the new DataFrame.
new_df = df.drop(list_of_tuples)
From a couple simple tests, using isin appears to be faster, although drop is a bit more readable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

From pandas array without duplicates to another data structure? - python

You can use pandas.DataFrame.drop_duplicates for more information drop_duplicates()

You are looking for np.unique: np.unique(df[0]) Or adapted in pandas as .unique(): df[0].unique()

Related

Removing all elements execpt the first one in list pandas dataframe

List of Series to Dataframe

How to limit Pandas loc selection

Specific string slicing

Keep rows from a dataframe whose index name is NOT in a given list

Categories

Resources