Searching values in dataframe using re.search - python

I have a lot of datasets that I need to iterate through, search for specific value and return some values based on search outcome.
Datasets are stored as dictionary:
key type size Value
df1 DataFrame (89,10) Column names:
df2 DataFrame (89,10) Column names:
df3 DataFrame (89,10) Column names:
Each dataset looks something like this, and I am trying to look if value in column A row 1 has 035 in it and return B column.
| A | B | C
02 la 035 nan nan
Target 7 5
Warning 3 6
If I try to search for specific value in it I get an error
TypeError: first argument must be string or compiled pattern
I have tried:
something = []
for key in df:
text = df[key]
if re.search('035', text):
something.append(text['B'])
Something = pd.concat([something], axis=1)

You can use .str.contains() : https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html
df = pd.DataFrame({
"A":["02 la 035", "Target", "Warning"],
"B":[0,7,3],
"C":[0, 5, 6]
})
df[df["A"].str.contains("035")] # Returns first row only.
Also works for regex.
df[df["A"].str.contains("[TW]ar")] # Returns last two rows.
EDIT to answer additional details.
The dataframe I set up looks like this:
To extract column B for those rows which match the last regex pattern I used, amend the last line of code to:
df[df["A"].str.contains("[TW]ar")]["B"]
This returns a series. Output:
Edit 2: I see you want a dataframe at the end. Just use:
df[df["A"].str.contains("[TW]ar")]["B"].to_frame()

Related

Join 2 columns of a dataframe based on syntax of values in the 2 columns

I have a Python dataframe and I am trying to combine the cells in the first 2 columns IF the first column value is a string with letters, and the second column value has the syntax of parentheses-single digit-parentheses.
eg: this is the current layout
0
1
2
text
(5)
moretext
this is what I want the result to be:
0
1
text (5)
moretext
I tried using the str.join() function but it's not working for me.
df1 = df.iloc[:, 0:1].str.join(r'(\(\d\))')
please let me know how I can write this, thank you
I believe join is suppose to join lists (which are inside one column) into a string and not several columns into a unique column (https://pandas.pydata.org/docs/reference/api/pandas.Series.str.join.html)
I might not have understood your problem completely but maybe this could work :
idx = df[(df[0].str.contains('\w') & df[1].str.contains('\(\d\)'))].index.values # find the indices that matches your criteria
df1 = pd.DataFrame()
df1[0] = df[0][idx] + ' ' + df[1][idx] # merges values of your columns for the proper indices
df1[1] = df[2][idx]

How to append the second column data below the value of first column data?

I have a dataframe as follows:
df
Name Sequence
abc ghijklmkhf
bhf uyhbnfkkkkkk
dmf hjjfkkd
I want to append the second column data to the below of the first column data in specific format as follows:
Name Sequence Merged
abc ghijklmkhf >abc
ghijklmkhf
bhf uyhbnfkkkkkk >bhf
uyhbnfkkkkkk
dmf hjjfkkd >dmf
hjjfkkd
I tried df['Name'] = '>' + df['Name'].astype(str) to get the name in the format with > symbol. How do I append the second column data below the value of first column data?
You can use vectorial concatenation:
df['Merged'] = '>'+df['Name']+'\n'+df['Sequence']
output:
Name Sequence Merged
0 abc ghijklmkhf >abc\nghijklmkhf
1 bhf uyhbnfkkkkkk >bhf\nuyhbnfkkkkkk
2 dmf hjjfkkd >dmf\nhjjfkkd
Checking that there are two lines:
print(df.loc[0, 'Merged'])
>abc
ghijklmkhf
As a complement to the mozway's solution, if you want to see the dataframe exactly with the format you mentioned, use the following:
from IPython.display import display, HTML
df["Merged"] = ">"+df["Name"]+"\n"+df["Sequence"]
def pretty_print(df):
return display(HTML(df.to_html().replace("\\n","<br>")))
pretty_print(df)

How to obtain the content of a pandas multilevel index entry?

I set up a pandas dataframes that besides my data stores the respective units with it using a MultiIndex like this:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
Now I can for example extract only the Volume_STP data by
Unit ccm/g
Description
0 29.3601
1 30.3071
2 31.1643
3 31.8513
4 32.3972
5 32.8724
With .values I can obtain a numpy array of the data. However how can I get the stored unit? I can't figure out what I need to do to receive the stored ccm/g string.
EDIT: Added example how data frame is generated
Let's say I have a string that looks like this:
Relative Volume # STP
Pressure
cc/g
4.26910e-02 29.3601
7.83190e-02 30.3071
1.29529e-01 31.1643
1.83355e-01 31.8513
2.33435e-01 32.3972
2.80847e-01 32.8724
3.34769e-01 33.4049
3.79123e-01 33.8401
I then use this function:
def read_result(contents, columns, units, descr):
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.drop(df.index[-1], inplace=True)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
return df
like this
def isotherm(contents):
columns = ['Relative_Pressure','Volume_STP']
units = ['-','ccm/g']
descr = ['p/p0','']
df = read_result(contents, columns, units, descr)
return df
to generate the DataFrame at the beginning of my question.
As df has a MultiIndex as columns, df.Volume_STP is still a pandas DataFrame. So you can still access its columns attribute, and the relevant item will be at index 0 because the dataframe contains only 1 Series.
So, you can extract the names that way:
print(df.Volume_STP.columns[0])
which should give: ('ccm/g', '')
At the end you extract the unit with .colums[0][0] and the description with .columns[0][1]
You can do something like this:
df.xs('Volume_STP', axis=1).columns.remove_unused_levels().get_level_values(0).tolist()[0]
Output:
'ccm/g'
Slice the dataframe from the 'Volume_STP' using xs, then select the columns remove the unused parts of the column headers, then get the value for the top most level of that slice which is the Units. Convert to a list as select the first value.
A generic way of accessing values on multi-index/columns is by using the index.get_level_values or columns.get_level_values functions of a data frame.
In your example, try df.columns.get_level_values(1) to access the second level of the multi-level column "Unit". If you have already selected a column, say "Volume_STP", then you have removed the top level and in this case, your units would be in the 0th level.

Iterating over multiIndex dataframe

I have a data frame as shown below
I have a problem in iterating over the rows. for every row fetched I want to return the key value. For example in the second row for 2016-08-31 00:00:01 entry df1 & df3 has compass value 4.0 so I wanted to return the keys which has the same compass value which is df1 & df3 in this case
I Have been iterating rows using
for index,row in df.iterrows():
Update
Okay so now I understand your question better this will work for you.
First change the shape of your dataframe with
dfs = df.stack().swaplevel(axis=0)
This will make your dataframe look like:
Then you can iterate the rows like before and extract the information you want. I'm just using print statements for everything, but you can put this in some more appropriate data structure.
for index, row in dfs.iterrows():
dup_filter = row.duplicated(keep=False)
dfss = row_tuple[dup_filter].index.values
print("Attribute:", index[0])
print("Index:", index[1])
print("Matches:", dfss, "\n")
which will print out something like
.....
Attribute: compass
Index: 5
Matches: ['df1' 'df3']
Attribute: gyro
Index: 5
Matches: ['df1' 'df3']
Attribute: accel
Index: 6
Matches: ['df1' 'df3']
....
You could also do it one attribute at a time by
dfs_compass = df.stack().swaplevel(axis=0).loc['compass']
and iterate through the rows with just the index.
Old
If I understand your question correctly, i.e. you want to return the indexes of rows which have matching values on the second level of your columns, i.e. ('compass', 'accel', 'gyro'). The following will work.
compass_match_indexes = []
for index, row in df.iterrows():
match_filter = row[:, 'compass'].duplicated()
if len(row[:, 'compass'][match_filter] > 0)
compass_match_indexes.append(index)
You can use select your dataframe with that list like df.loc[compass_match_indexes]
--
Another approach, you could get the transform of your DataFrame with df.T and then use the duplicated function.

Data selection using pandas

I have a file where the separator(delimiter) is ';' . I read that file into a pandas dataframe df. Now, I want to select some rows from df using a criteria from column c in df. The format of data in column c is as follows:
[0]science|time|boot
[1]history|abc|red
and so on...
I have another list of words L, which has values such as
[history, geography,....]
Now, if I split the text in column c on '|', then I want to select those rows from df, where the first word does not belong to L.
Therefore, in this example, I will select df[0] but will not chose df[1], since history is present in L and science is not.
I know, I can write a for loop and iter over each object in the dataframe but I was wondering if I could do something in a more compact and efficient way.
For example, we can do:
df.loc[df['column_name'].isin(some_values)]
I have this:
df = pd.read_csv(path, sep=';', header=None, error_bad_lines=False, warn_bad_lines=False)
dat=df.ix[:,c].str.split('|')
But, I do not know how to index 'dat'. 'dat' is a Pandas Series, as follows:
0 [science, time, boot]
1 [history, abc, red]
....
I tried indexing dat as follows:
dat.iloc[:][0]
But, it gives the entire series instead of just the first element.
Any help would be appreciated.
Thank You in advance
Here is an approach:
Data
df = pd.DataFrame({'c':['history|science','science|chemistry','geography|science','biology|IT'],'col2':range(4)})
Out[433]:
c col2
0 history|science 0
1 science|chemistry 1
2 geography|science 2
3 biology|IT 3
lst = ['geography', 'biology','IT']
Resolution
You can use list comprehension:
df.loc[pd.Series([not x.split('|')[0] in lst for x in df.c.tolist()])]
Out[444]:
c col2
0 history|science 0
1 science|chemistry 1

Categories

Resources