Numpy select by string from pandas DataFrame

Numpy select by string from pandas DataFrame - python

I would like to create a new column in my pandas DataFrame based on matching strings. I have pathnames of images that contain either the string 'distorted' or 'original'. I would like to assign the string values 'd' and 'o' in the new column respectively. I have been using np.select but I got a shape-mismatch error.
This is my code:
type_cond = [(df[df['img_name'].str.contains(r'\bdistorted\b')]), (df[df['img_name'].str.contains(r'\boriginal\b')])]
type_values = ['d', 'o']
df['image_type'] = np.select(type_cond, type_values)
When I run the conditions separately, I get the expected output:
distorted = df[df['img_name'].str.contains(r'\bdistorted\b')]
output:
id
n
r
img_name
rid
...
2995
I
2
images/distorted/png/3MRNMEIQW56USS7S1XTZ20C8J...
E
2996
I
3
images/distorted/png/30MVJZJNHMDCUC6BMWCK0PGQO...
E
2997
I
2
images/distorted/png/3MYYFCXHJ37164AYXVVQM4DUA...
E
2998
I
3
images/distorted/png/39RP059MEHTLJDRTND387N3XG...
E
2999
I
1
images/distorted/png/3EKVH9QMEY4OR6LKRRBUN4DZD...
E
[2003 rows x 4 columns]
When filtering the strings that contain 'original' it selects: [997 rows x 4 columns]
The entire data frame is of size: [3000 rows x 4 columns]
I don't see why there is a shape mismatch because all the rows are covered by either condition.

There is problem in conditions list are filtered DataFrames.
So need remove boolean indexing - (df[]):
type_cond = [df['img_name'].str.contains(r'\bdistorted\b'),
df['img_name'].str.contains(r'\boriginal\b')]

Related

seach for substring with minimum characters match pandas

I have 1st dataFrame with column 'X' as :
X
A468593-3
A697269-2
A561044-2
A239882 04
2nd dataFrame with column 'Y' as :
Y
000A561044
000A872220
I would like to match the part of substrings from both columns with minimum no. of characters(example 7 chars only alphanumeric to be considered for matching. all special chars to be excluded).
so, my output DataFrame should be like this
X
A561044-2
Any possible solution would highly appreciate.
Thanks in advance.

IIUC and assuming that the first three values of Y start with 0, you can slice Y by [3:] to remove the first three zero values. Then, you can join these values by |. Finally, you can create your mask using contains that checks whether a series contains a specified value (in your case you would have something like 'A|B' and check whether a value contains 'A' or 'B'). Then, this mask can be used to query your other data frame.
Code:
import pandas as pd
df1 = pd.DataFrame({"X": ["A468593-3", "A697269-2", "A561044-2", "A239882 04"]})
df2 = pd.DataFrame({"Y": ["000A561044", "000A872220"]})
mask = df1["X"].str.contains(f'({"|".join(df2["Y"].str[3:])})')
df1.loc[mask]
Output:
X
2 A561044-2
If you have values in Y that do not start with three zeros, you can use this function to reduce your columns and remove all first n zeros.
def remove_first_numerics(s):
counter = 0
while s[counter].isnumeric():
counter +=1
return s[counter:]
df_test = pd.DataFrame({"A": ["01Abd3Dc", "3Adv3Dc", "d31oVgZ", "10dZb1B", "CCcDx10"]})
df_test["A"].apply(lambda s: remove_first_numerics(s))
Output:
0 Abd3Dc
1 Adv3Dc
2 d31oVgZ
3 dZb1B
4 CCcDx10
Name: A, dtype: object

Delete the rows that contain the string - Pandas dataframe

I want to convert columns in DataFrame from OBJECT to INT. I need to completely delete the lines that contain the string.
The following expression "saves" the data I care about and converts the column from the OBJECT to INT type:
df["column name"] = df["column name"].astype(str).str.replace(r'/\d+$', '').astype(int)
However,before this, rows that contain letters (A-Z) I want to delete completely.
I tried:
df[~df["column name"].str.lower().str.startswith('A-Z')]
Also I tried a few other expressions, however, no data cleans.
DataFrame looks something like this:
A B C
0 8161 0454 9600
1 - 3780 1773 1450
2 2564 0548 5060
3 1332 9179 2040
4 6010 3263 1050
5 I Forgot 7849 1400/10000
Col C - 1400/10000 - The first expression I wrote simply removes "/ 10000" and remains "1400"
Now I need to remove the word expressions as in the "A5"

Using regular expression you can create a mask for all rows that contains a character between [a-z]. Then you can drop this rows. Like this:
mask = df['a'].str.lower().str.contains("[a-z]")
idx = df.index[mask]
df = df.drop(idx, axis=0)

Convert to integer numeric strings pandas dataframe

I need to merge two pandas data frames using a columns which contains numerical values.
For example, the two data frames could be like the following ones:
data frame "a"
a1 b1
0 "x" 13560
1 "y" 193309
2 "z" 38090
3 "k" 37212
data frame "b"
a2 b2
0 "x" 13,56
1 "y" 193309
2 "z" 38,09
3 "k" 37212
What i need to do, is merge a with b on column b1/b2.
The problem is that as you can see, some values of data frame b', are a little bit different. First of all, b' values are not integers but strings and furthermore, the values which end with 0 are "rounded" (13560 --> 13,56).
What i've tried to do, is replace the comma and then cast them to int, but it doesn't work; more in details this procedure doesn't add the missing zero.
This is the code that i've tried:
b['b2'] = b['b2'].str.replace(",", "")
b['b2'] = b['b2'].astype(np.int64) # np is numpy
Is there any procedure that i can use to fix this problem?

I believe need create boolean mask for specify which values has to be multiple:
#or add parameter thousands=',' to read_csv like suggest #Inder
b['b2'] = b['b2'].str.replace(",", "", regex=True).astype(np.int64)
mask = b['b2'] < 10000
b['b2'] = np.where(mask, b['b2'] * 10, b['b2'])
print (b)
a2 b2
0 x 13560
1 y 193309
2 z 38090
3 k 37212

Correcting the column first with a apply and a lambda function:
b.b2 = b.b2.apply(lambda x: int(x.replace(',','')) * 10 if ',' in x else int(x))

The difference between double brace `[[...]]` and single brace `[..]` indexing in Pandas

I'm confused about the syntax regarding the following line of code:
x_values = dataframe[['Brains']]
The dataframe object consists of 2 columns (Brains and Bodies)
Brains Bodies
42 34
32 23
When I print x_values I get something like this:
Brains
0 42
1 32
I'm aware of the pandas documentation as far as attributes and methods of the dataframe object are concerned, but the double bracket syntax is confusing me.

Consider this:
Source DF:
In [79]: df
Out[79]:
Brains Bodies
0 42 34
1 32 23
Selecting one column - results in Pandas.Series:
In [80]: df['Brains']
Out[80]:
0 42
1 32
Name: Brains, dtype: int64
In [81]: type(df['Brains'])
Out[81]: pandas.core.series.Series
Selecting subset of DataFrame - results in DataFrame:
In [82]: df[['Brains']]
Out[82]:
Brains
0 42
1 32
In [83]: type(df[['Brains']])
Out[83]: pandas.core.frame.DataFrame
Conclusion: the second approach allows us to select multiple columns from the DataFrame. The first one just for selecting single column...
Demo:
In [84]: df = pd.DataFrame(np.random.rand(5,6), columns=list('abcdef'))
In [85]: df
Out[85]:
a b c d e f
0 0.065196 0.257422 0.273534 0.831993 0.487693 0.660252
1 0.641677 0.462979 0.207757 0.597599 0.117029 0.429324
2 0.345314 0.053551 0.634602 0.143417 0.946373 0.770590
3 0.860276 0.223166 0.001615 0.212880 0.907163 0.437295
4 0.670969 0.218909 0.382810 0.275696 0.012626 0.347549
In [86]: df[['e','a','c']]
Out[86]:
e a c
0 0.487693 0.065196 0.273534
1 0.117029 0.641677 0.207757
2 0.946373 0.345314 0.634602
3 0.907163 0.860276 0.001615
4 0.012626 0.670969 0.382810
and if we specify only one column in the list we will get a DataFrame with one column:
In [87]: df[['e']]
Out[87]:
e
0 0.487693
1 0.117029
2 0.946373
3 0.907163
4 0.012626

There is no special syntax in Python for [[ and ]]. Rather, a list is being created, and then that list is being passed as an argument to the DataFrame indexing function.
As per #MaxU's answer, if you pass a single string to a DataFrame a series that represents that one column is returned. If you pass a list of strings, then a DataFrame that contains the given columns is returned.
So, when you do the following
# Print "Brains" column as Series
print(df['Brains'])
# Return a DataFrame with only one column called "Brains"
print(df[['Brains']])
It is equivalent to the following
# Print "Brains" column as Series
column_to_get = 'Brains'
print(df[column_to_get])
# Return a DataFrame with only one column called "Brains"
subset_of_columns_to_get = ['Brains']
print(df[subset_of_columns_to_get])
In both cases, the DataFrame is being indexed with the [] operator.
Python uses the [] operator for both indexing and for constructing list literals, and ultimately I believe this is your confusion. The outer [ and ] in df[['Brains']] is performing the indexing, and the inner is creating a list.
>>> some_list = ['Brains']
>>> some_list_of_lists = [['Brains']]
>>> ['Brains'] == [['Brains']][0]
True
>>> 'Brains' == [['Brains']][0][0] == [['Brains'][0]][0]
True
What I am illustrating above is that at no point does Python ever see [[ and interpret it specially. In the last convoluted example ([['Brains'][0]][0]) there is no special ][ operator or ]][ operator... what happens is
A single-element list is created (['Brains'])
The first element of that list is indexed (['Brains'][0] => 'Brains')
That is placed into another list ([['Brains'][0]] => ['Brains'])
And then the first element of that list is indexed ([['Brains'][0]][0] => 'Brains')

Other solutions demonstrate the difference between a series and a dataframe. For the Mathematically minded, you may wish to consider the dimensions of your input and output. Here's a summary:
Object Series DataFrame
Dimensions (obj.ndim) 1 2
Syntax arg dim 0 1
Syntax df['col'] df[['col']]
Max indexing dim 1 2
Label indexing df['col'].loc[x] df.loc[x, 'col']
Label indexing (scalar) df['col'].at[x] df.at[x, 'col']
Integer indexing df['col'].iloc[x] df.iloc[x, 'col']
Integer indexing (scalar) df['col'].iat[x] dfi.at[x, 'col']
When you specify a scalar or list argument to pd.DataFrame.__getitem__, for which [] is syntactic sugar, the dimension of your argument is one less than the dimension of your result. So a scalar (0-dimensional) gives a 1-dimensional series. A list (1-dimensional) gives a 2-dimensional dataframe. This makes sense since the additional dimension is the dataframe index, i.e. rows. This is the case even if your dataframe happens to have no rows.

[ ] and [[ ]] are the concept of NumPy.
Try to understand the basics of np.array creating and use reshape and check with ndim, you'll understand.
Check my answer here.
https://stackoverflow.com/a/70194733/7660981

Creating a New Pandas Grouped Object

In some transformations, I seem to be forced to break from the Pandas dataframe grouped object, and I would like a way to return to that object.
Given a dataframe of time series data, if one groups by one of the values in the dataframe, we are given an underlying dictionary from key to dataframe.
Being forced to make a Python dict from this, the structure cannot be converted back into a Dataframe using the .from_dict() because the structure is key to dataframe.
The only way to go back to Pandas without some hacky column renaming is, to my knowledge, by converting it back to a grouped object.
Is there any way to do this?
If not, how would I convert a dictionary of instance to dataframe back into a Pandas datastructure?
EDIT ADDING SAMPLE::
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(randn(len(rng)), index=rng), 'b':pd.Series(randn(len(rng)), index=rng)})
// now have dataframe with 'a's and 'b's in time series
for k, v in df.groupby('a'):
df_dict[k] = v
// now we apply some transformation that cannot be applied view aggregate, transform, or apply
// how do we get this back into a groupedby object?

If I understand OP's question correctly, you want to group a dataframe by some key(s), do different operations on each group (possibly generating new columns, etc.) and then go back to the original dataframe.
Modifying you example (group by random integers instead of floats which are usually unique):
np.random.seed(200)
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(np.random.randn(len(rng)), index=rng), 'b':pd.Series(np.random.randn(len(rng)), index=rng)})
df['group'] = np.random.randint(3,size=(len(df)))
Usually, If I need single values for each columns per group, I'll do this (for example, sum of 'a', mean of 'b')
In [10]: df.groupby('group').aggregate({'a':np.sum, 'b':np.mean})
Out[10]:
a b
group
0 -0.214635 -0.319007
1 0.711879 0.213481
2 1.111395 1.042313
[3 rows x 2 columns]
However, if I need a series for each group,
In [19]: def func(sub_df):
sub_df['c'] = sub_df['a'] * sub_df['b'].shift(1)
return sub_df
....:
In [20]: df.groupby('group').apply(func)
Out[20]:
a b group c
2000-01-31 -1.450948 0.073249 0 NaN
2000-11-30 1.910953 1.303286 2 NaN
2001-09-30 0.711879 0.213481 1 NaN
2002-07-31 -0.247738 1.017349 2 -0.322874
2003-05-31 0.361466 1.911712 2 0.367737
2004-03-31 -0.032950 -0.529672 0 -0.002414
2005-01-31 -0.221347 1.842135 2 -0.423151
2005-11-30 0.477257 -1.057235 0 -0.252789
2006-09-30 -0.691939 -0.862916 2 -1.274646
2007-07-31 0.792006 0.237631 0 -0.837336
[10 rows x 4 columns]
I'm guess you want something like the second example. But the original question wasn't very clear even with your example.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy select by string from pandas DataFrame - python

There is problem in conditions list are filtered DataFrames. So need remove boolean indexing - (df[]): type_cond = [df['img_name'].str.contains(r'\bdistorted\b'), df['img_name'].str.contains(r'\boriginal\b')]

Related

seach for substring with minimum characters match pandas

Delete the rows that contain the string - Pandas dataframe

Convert to integer numeric strings pandas dataframe

The difference between double brace `[[...]]` and single brace `[..]` indexing in Pandas

Creating a New Pandas Grouped Object

Categories

Resources