Insert data into grouped DataFrame (pandas)

Insert data into grouped DataFrame (pandas) - python

I have a pandas dataframe grouped by certain columns. Now I want to insert the mean of the numeric values of four adjacent columns into a new column. This is what I did:
df = pd.read_csv(filename)
# in this line I extract a unique ID from the filename
id = re.search('(\w\w\w)', filename).group(1)
Files look like this:
col1 | col2 | col3
-----------------------
str1a | str1b | float1
My idea was now the following:
# get the numeric values
df2 = pd.DataFrame(df.groupby(['col1', 'col2']).mean()['col3'].T
# insert the id into a new column
df2.insert(0, 'ID', id)
Now loop over all
for j in range(len(df2.values)):
for k in df['col1'].unique():
df2.insert(j+5, (k, 'mean'), df2.values[j])
df2.to_excel('text.xlsx')
But I get the following error, referring to the line with df.insert:
TypeError: not all arguments converted during string formatting
and
if not allow_duplicates and item in self.items:
# Should this be a different kind of error??
raise ValueError('cannot insert %s, already exists' % item)
I am not sure what string formatting refers to here, since I have only numerical values being passed around.
The final output should have all values from col3 in a single row (indexed by id) and every fifth column should be the inserted mean value of the four preceding values.

If I had to work with files like yours I code a function to convert to csv... something like that:
data = []
for lineInFile in file.read().splitlines():
lineInFile_splited = lineInFile.split('|')
if len(lineInFile_splited)>1: ## get only data and not '-------'
data.append(lineInFile_splited)
df = pandas.DataFrame(data, columns = ['A','B'])
Hope it helps!

Related

Searching values in dataframe using re.search

I have a lot of datasets that I need to iterate through, search for specific value and return some values based on search outcome.
Datasets are stored as dictionary:
key type size Value
df1 DataFrame (89,10) Column names:
df2 DataFrame (89,10) Column names:
df3 DataFrame (89,10) Column names:
Each dataset looks something like this, and I am trying to look if value in column A row 1 has 035 in it and return B column.
| A | B | C
02 la 035 nan nan
Target 7 5
Warning 3 6
If I try to search for specific value in it I get an error
TypeError: first argument must be string or compiled pattern
I have tried:
something = []
for key in df:
text = df[key]
if re.search('035', text):
something.append(text['B'])
Something = pd.concat([something], axis=1)

You can use .str.contains() : https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html
df = pd.DataFrame({
"A":["02 la 035", "Target", "Warning"],
"B":[0,7,3],
"C":[0, 5, 6]
})
df[df["A"].str.contains("035")] # Returns first row only.
Also works for regex.
df[df["A"].str.contains("[TW]ar")] # Returns last two rows.
EDIT to answer additional details.
The dataframe I set up looks like this:
To extract column B for those rows which match the last regex pattern I used, amend the last line of code to:
df[df["A"].str.contains("[TW]ar")]["B"]
This returns a series. Output:
Edit 2: I see you want a dataframe at the end. Just use:
df[df["A"].str.contains("[TW]ar")]["B"].to_frame()

Split a column of a dataframe into two separate columns

I'd like to split a column of a dataframe into two separate columns. Here is how my dataframe looks like (only the first 3 rows):
I'd like to split the column referenced_tweets into two columns: type and id in a way that for example, for the first row, the value of the type column would be replied_to and the value of id would be 1253050942716551168.
Here is what I've tried:
df[['type', 'id']] = df['referenced_tweets'].str.split(',', n=1, expand=True)
but I get the error:
ValueError: Columns must be the same length as key
(I think I get this error because the type in the referenced_tweets column is NOT always replied_to (e.g., it can be retweeted, and therefore, the lengths would be different)

Why not get the values from the dict and add it two new columns?
def unpack_column(df_series, key):
""" Function that unpacks the key value of your column and skips NaN values """
return [None if pd.isna(value) else value[0][key] for value in df_series]
df['type'] = unpack_column(df['referenced_tweets'], 'type')
df['id'] = unpack_column(df['referenced_tweets'], 'id')
or in a one-liner:
df[['type', 'id']] = df['referenced_tweets'].apply(lambda x: (x[0]['type'], x[0]['id']))

Identify column names from DataFrame in Python when Column names are unknown

I run a for loop for executing few SQL queries . I have the results captured in a DataFrame (again inside the loop) as below for two validations.
DATAFRAME for Test1:
index column1 column2
0 jack 100
1 bill 200
2 Tom 300
DATAFRAME Looks for Test2:
index column1
0 102345
1 102345
I have to write the results of Dataframe for each Test to another table in Oracle . In order to do this , I need to get the column names. I am unable to Identify how many column names are present at a given point in time in the loop as the Dataframe can have from 1-5 columns depending upon the SQL run . Is there a way to do this .
Code for reading from table and writing to DataFrame:
def get_src_query_metadata(cursor, sql_query):
cursor.execute(sql_query)
columns = [col[0] for col in cursor.description]
cursor.rowfactory = lambda *args: dict(zip(columns, args))
data = pd.DataFrame(cursor.fetchall())
return data
def get_target_query_metadata(cursor, sql_query):
cursor.execute(sql_query)
columns = [col[0] for col in cursor.description]
cursor.rowfactory = lambda *args: dict(zip(columns, args))
data = pd.DataFrame(cursor.fetchall())
return data
def main():
_JobDict_src = get_src_query_metadata(cursor, src_query[i])
_JobDict_tgt = get_target_query_metadata(cursor, target_query[i])
How do I get the column names and its values assign to separate variables .

You can find and count column names through this loop
coln=0
for col in df.columns:
coln+=1
print(col)
print(coln)
and find data types through the following
for col in df.dtypes:
print(col)

How to extract first occurance of the data within the delimiter based on keyvalues?

I have a dataframe as follows:
Items Data
enst.35 abc|frame|gtk|enst.24|pc|hg|,abc|framex|gtk4|enst.35|pxc|h5g|,abc|frbx|hgk4|enst.23|pix|hoxg|,abc|framex|gtk4|enst.35|pxc|h5g|
enst.18 abc|frame|gtk|enst.15|pc|hg|,abc|framex|gtk2|enst.59|pxc|h5g|,abc|frbx|hgk4|enst.18|pif|holg|,abc|framex|gtk4|enst.35|pxc|h5g|
enst.98 abc|frame|gtk|enst.98|pc|hg|,abc|framex|gtk1|enst.45|pxc|h5g|,abc|frbx|hgk4|enst.74|pig|ho6g|,abc|framex|gtk4|enst.35|pxc|h5g|
enst.63 abc|frame|gtk|enst.34|pc|hg|,abc|framex|gtk1|enst.67|pxc|h5g|,abc|frbx|hgk4|enst.39|pik|horg|,abc|framex|
I want to extract Data based on the Items value within the frame and extract only that data with in the separators (,). I want to match row1 value of col1 to row1 of col2. Similarly, row2 of col1 to row2 of col2....
If match is not found fill with 'NA' in the output columns. There can be multiple occurance of id in the same column, but I want to consider only the first occurrence.
The expected output is:
abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA
I tried follwing code to generate the output:
import pandas as pd
df=pd.read_table('file1.txt', sep="\t")
keywords=df['Items'].to_list()
df_map=df.Data[df.Data.str.contains('|'.join(as_list))].reindex(df.index)
But the output generated has all the terms containing the keywords:
Data
abc|frame|gtk|enst.24|pc|hg|,abc|framex|gtk4|enst.35|pxc|h5g|,abc|frbx|hgk4|enst.23|pix|hoxg|abc|framex|gtk4|enst.35|pxc|h5g|
abc|frame|gtk|enst.15|pc|hg|,abc|framex|gtk2|enst.59|pxc|h5g|,abc|frbx|hgk4|enst.18|pif|holg|abc|framex|gtk4|enst.35|pxc|h5g|
abc|frame|gtk|enst.98|pc|hg|,abc|framex|gtk1|enst.45|pxc|h5g|,abc|frbx|hgk4|enst.74|pig|ho6g|abc|framex|gtk4|enst.35|pxc|h5g|
NA
What are the possible changes I can make in the code to generate correct ouput as expected.

Use, DataFrame.apply along the axis=1 and apply the custom function which extracts the string associated with the occurrence of df['Items'] in df['Data']:
import re
def find(s):
mobj = re.search(rf"[^,]+{re.escape(s['Items'])}[^,]+", s['Data'])
if mobj:
return mobj.group(0)
return np.nan
df['Data'] = df.apply(find, axis=1)
OR, Use a more faster solution:
pattern = '|'.join([rf'[^,]+{re.escape(k)}[^,]+'for k in df['Items']])
df['Data'] = df['Data'].str.findall(pattern).str.get(0)
# print(df['Data'])
0 abc|framex|gtk4|enst.35|pxc|h5g|
1 abc|frbx|hgk4|enst.18|pif|holg|
2 abc|frame|gtk|enst.98|pc|hg|
3 NaN
Name: Data, dtype: object

we can formally define a key-value pair list as follows:
kvlist = <key>[kvdelim]<value>([pairdelim]<key>[kvdelim]<value>)*
key = <string>|<quoter><string><quoter>
value = <string>|<quoter><string><quoter>
quoter = "

Iterating over multiIndex dataframe

I have a data frame as shown below
I have a problem in iterating over the rows. for every row fetched I want to return the key value. For example in the second row for 2016-08-31 00:00:01 entry df1 & df3 has compass value 4.0 so I wanted to return the keys which has the same compass value which is df1 & df3 in this case
I Have been iterating rows using
for index,row in df.iterrows():

Update
Okay so now I understand your question better this will work for you.
First change the shape of your dataframe with
dfs = df.stack().swaplevel(axis=0)
This will make your dataframe look like:
Then you can iterate the rows like before and extract the information you want. I'm just using print statements for everything, but you can put this in some more appropriate data structure.
for index, row in dfs.iterrows():
dup_filter = row.duplicated(keep=False)
dfss = row_tuple[dup_filter].index.values
print("Attribute:", index[0])
print("Index:", index[1])
print("Matches:", dfss, "\n")
which will print out something like
.....
Attribute: compass
Index: 5
Matches: ['df1' 'df3']
Attribute: gyro
Index: 5
Matches: ['df1' 'df3']
Attribute: accel
Index: 6
Matches: ['df1' 'df3']
....
You could also do it one attribute at a time by
dfs_compass = df.stack().swaplevel(axis=0).loc['compass']
and iterate through the rows with just the index.
Old
If I understand your question correctly, i.e. you want to return the indexes of rows which have matching values on the second level of your columns, i.e. ('compass', 'accel', 'gyro'). The following will work.
compass_match_indexes = []
for index, row in df.iterrows():
match_filter = row[:, 'compass'].duplicated()
if len(row[:, 'compass'][match_filter] > 0)
compass_match_indexes.append(index)
You can use select your dataframe with that list like df.loc[compass_match_indexes]
--
Another approach, you could get the transform of your DataFrame with df.T and then use the duplicated function.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Insert data into grouped DataFrame (pandas) - python

Related

Searching values in dataframe using re.search

Split a column of a dataframe into two separate columns

Identify column names from DataFrame in Python when Column names are unknown

How to extract first occurance of the data within the delimiter based on keyvalues?

Iterating over multiIndex dataframe

Categories

Resources