I have a pandas dataframe with one column and 100 rows. I would like to merge all the values of a column into one single value.
Ex:
S.No Text
0 abc
1 def
2 ghi
3 jkl
4 mno
I want the result to be "abcdefghijklmno" as a one single value.
Any ideas?
Two pandas way
df.Text.sum()
Out[72]: 'abcdefghijklmno'
''.join(df.Text)
Out[77]: 'abcdefghijklmno'
Using for loop .
s=''
for x,y in df.iterrows():
s+=y['Text']
Related
I'm having trouble on how to match regex in two different dataframe that is linked with its type and unique country. Here is the sample for the data df and the regex df. Note that the shape for these two dataframe is different because the regex df containing just unique value.
**Data df** **Regex df**
**Country Type Data** **Country Type Regex**
MY ABC MY1234567890 MY ABC ^MY[0-9]{10}
IT ABC IT1234567890 IT ABC ^IT[0-9]{10}
PL PQR PL123456 PL PQR ^PL
MY XYZ 456792abc MY XYZ ^\w{6,10}$
IT ABC MY45889976
IT ABC IT567888976
I have tried to merge them together and just use lambda to do the matching. Below is my code,
df.merge(df_regex,left_on='Country',right_on="Country")
df['Data Quality'] = df.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
But, it will add another row for each of the different type and country. So there will be a lot of duplication which is not efficient and time consuming.
Is there any pythonic way to match the data to its country and type but the reference is in another dataframe. without merging those 2 df. Then if its match to its own regex, it will return 1, else 0.
To avoid repetition based on Type we should include Type also in the joining conditions, Now apply the lambda
df2 = df.merge(df_regex, left_on=['Country', 'Type'],right_on=['Country', 'Type'])
df2['Data Quality'] = df2.apply(lambda r:re.match(r['Regex'],r['Data']) and 1 or 0, axis=1)
df2
It will give you the following output.
Country Type Data Regex Data Quality
0 MY ABC MY1234567890 ^MY[0-9]{10} 1
1 IT ABC IT1234567890 ^IT[0-9]{10} 1
2 IT ABC MY45889976 ^IT[0-9]{10} 0
3 IT ABC IT567888976 ^IT[0-9]{10} 0
4 PL PQR PL123456 ^PL 1
5 MY XYZ 456792abc ^\w{6,10}$ 1
I am having a dataset which look like follows(in dataframe):
**_id** **paper_title** **references** **full_text**
1 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
2 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
3 XYZ [{'abc':'something'},{'def':'something'},...many others] something
Expected:
**_id** **paper_title** **abc** **def** **full_text**
1 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
2 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
I have tried df['column_name'].apply(pd.Series).apply(pd.Series) to split the list and dictionaries into columns of dataframe but doesn't help as it didn't split dictionaries.
First row of my dataframe:
df.head(1)
Assuming your original DataFrame is a list of dictionaries with one key:value pair and a key named 'reference':
print(df)
id paper_title references full_text
0 1 xyz [{'reference': 'description1'}, {'reference': ... some text
1 2 xyz [{'reference': 'descriptiona'}, {'reference': ... more text
2 3 xyz [{'reference': 'descriptioni'}, {'reference': ... even more text
Then you can use concat to separate out your references with their index:
df1 = pd.concat([pd.DataFrame(i) for i in df['references']], keys = df.index).reset_index(level=1,drop=True)
print(df1)
reference
0 description1
0 description2
0 description3
1 descriptiona
1 descriptionb
1 descriptionc
2 descriptioni
2 descriptionii
2 descriptioniii
Then use DataFrame.join to join the columns back together on their index:
df = df.drop('references', axis=1).join(df1).reset_index(drop=True)
print(df)
id paper_title full_text reference
0 1 xyz some text description1
1 1 xyz some text description2
2 1 xyz some text description3
3 2 xyz more text descriptiona
4 2 xyz more text descriptionb
5 2 xyz more text descriptionc
6 3 xyz even more text descriptioni
7 3 xyz even more text descriptionii
8 3 xyz even more text descriptioniii
After a lot of Documentation reading of pandas, I found the explode method applying with apply(pd.Series) is the easiest of what I was looking for in the question.
Here is the Code:
df = df.explode('reference')
# It explodes the lists to rows of the subset columns
df = df['reference'].apply(pd.Series).merge(df, left_index=True, right_index=True, how ='outer')
# split a list inside a Dataframe cell into rows and merge with original dataframe like (AUB) in set theory
Sidenote: while merging look for unique values in column as there will many columns with duplicated values
I hope this helps someone with dataframe/Series with columns having list containing multiple dictionaries and want to split list of multiple dictionaries key to new column with values as their rows.
How can I slice column values based on first & last character location indicators from two other columns?
Here is the code for a sample df:
import pandas as pd
d = {'W': ['abcde','abcde','abcde','abcde']}
df = pd.DataFrame(data=d)
df['First']=[0,0,0,0]
df['Last']=[1,2,3,5]
df['Slice']=['a','ab','abc','abcde']
print(df.head())
Code output:
Desired Output:
Just do it with for loop , you may worry about the speed , please check For loops with pandas - When should I care?
df['Slice']=[x[y:z]for x,y,z in zip(df.W,df.First,df.Last)]
df
Out[918]:
W First Last Slice
0 abcde 0 1 a
1 abcde 0 2 ab
2 abcde 0 3 abc
3 abcde 0 5 abcde
I am not sure if this will be faster, but a similar approach would be:
df['Slice'] = df.apply(lambda x: x[0][x[1]:x[2]],axis=1)
Briefly, you go through each row (axis=1) and apply a custom function. The function takes the row (stored as x), and slices the first element using the second and third elements as indices for the slicing (that's the lambda part). I will be happy to elaborate more if this isn't clear.
I have a data frame df with columns a and t, where column "a" has strings and column "t" has integers. I want to select all the row pair(s) from the dataframe, for which the values in column "a" for that row pairs are same AND the difference in values in column "t" for that pair has minimum value. For Example:
df = a t
abc 4
abc 3
def 2
abc 1
I want to get the following result:
df = a t
abc 4
abc 3
I know we can use two for loops in the same data frame but I am looking for more efficient solution.
Thanks in anticipation
You may use:
df = df.sort_values(['a', 't'], ascending=False)
diff_ = df['t']-df['t'].shift(-1)
min_idx = diff_[df['a'] == df['a'].shift(-1)].idxmin()
df.loc[min_idx:min_idx+1]
Output:
a t
0 abc 4
1 abc 3
I have the following problem: I want to create a column in a dataframe summarizing all values in a row. Then I want to compare the rows of that column to create a single row containg all the values from all columns, but so that each value is only present a single time. As example: I have the following data frame
df1:
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
the desired output would now be:
df1_new:
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9
What I am currently trying is result = df1.groupby('Column1'), but then I don't know how to compare the values in the rows of the grouped objects and then write them to the new column and removing the duplicates. I read through the pandas documentation of Group By: split-apply-combine but could not figure out a way to do it. I also wonder if, once I have my desired output, there is a way to check in how many of the lines in the grouped object each value in Column2 of df1_new appeared. Any help on this would be greatly appreciated!
A method by which you can do this would be to apply a function on the grouped DataFrame.
This function would first convert the series (for each group) to a list, and then in the list split each string using , and then chain the complete list into a single list using itertools.chain.from_iterable and then convert that to set so that only unique values are left and then sort it and then convert back to string using str.join . Example -
from itertools import chain
def applyfunc(x):
ch = chain.from_iterable(y.split(',') for y in x.tolist())
return ','.join(sorted(set(ch)))
df1_new = df1.groupby('Column1')['Column2'].apply(func1).reset_index()
Demo -
In [46]: df
Out[46]:
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
In [47]: from itertools import chain
In [48]: def applyfunc(x):
....: ch = chain.from_iterable(y.split(',') for y in x.tolist())
....: return ','.join(sorted(set(ch)))
....:
In [49]: df.groupby('Column1')['Column2'].apply(func1).reset_index()
Out[49]:
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9
What about this:
df1
Column1 Column2
0 a 1,2,3
1 a 1,4,5
2 b 7,1,5
3 c 8,9
4 b 7,3,5
df1.groupby('Column1').\
agg(lambda x: ','.join(x).split(','))['Column2'].\
apply(lambda x: ','.join(np.unique(x))).reset_index()
Column1 Column2
0 a 1,2,3,4,5
1 b 1,3,5,7
2 c 8,9