pandas select row if value in another column changs - python

Let's say I have a large data set that follows a similar structure:
where the id repeats multiple times. I would like to select any id where the value in column b changed with the desired output as such:
How might I be able to achieve that via pandas?

It is not entirely clear what you are asking for. You say
I would like to select any id where the value in column b changed
but 'changed' from what?
Perhaps the following can be helpful -- it will show you all unique ColumnB strings for each id
Using a sample df:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2], 'colb':['a','a','b','c','d','c','c']})
we use groupby and unique:
df.groupby('id')['colb'].unique().explode().to_frame()
output:
colb
id
1 a
1 b
2 c
2 d
so for id=1 we have a and b as unique phrases, and for id=2 we have c and d

Related

Ordering by multiple columns including Count in PySpark

I am currently working on understanding Pyspark, and am running into a problem. I am attempting to resolve how to order by multiple columns in the dataframe, when one of these is a count.
As an example, say I have a dataframe (df) with three columns, A,B,and C. I want to group by A and B, and then count these instances. So if there are 10 instances where A=1 and B=1, the Table for that row should look like:
A|B|Count
1 1 10
I have determined that I can do this fairly easily by running:
df.groupBy('A', 'B').count()
Then if I want to order this dataframe by count (descending), this is also pretty straightforward:
df.groupBy('A', 'B').count().orderBy(desc("count"))
This next step is where I am having trouble. What if now I want to also order by column C, ie order first by count, and then by C? I had thought that the syntax would be something akin to:
df.groupBy('A', 'B').count().orderBy(desc("count"), desc("C"))
But this does not work, presumably because once I run count(), the dataframe is limited to only the columns A, B, and count. Do I need to somehow create a new column in the original dataframe with the count column, and if so, how can I do this?
Is there another simpler way that I am missing to order by both count and C?
For clarity an example dataframe that I would like to end with could appear as:
A|B|Count|C
1 1 10 5
1 2 9 3
1 5 9 1
2 4 8 10
2 7 8 5
Any insights or guidance are greatly appreciated.
Try using a window Function , the column 'C' is not in the group by, hence is not available for order/sorting the columns. If you just want the grouped columns eg A,B and the count column, you can always use select statement to get just that after the window function.
from pyspark.sql.window import Window
windowSpec = Window.partitionBy("A","B")
df.withColumn('count',F.count('*').over(windowSpec)).select("A","B","count").distinct().orderBy(F.col('count').desc(),F.col('C').desc()).show()

filtering pandas dataframe when data contains two parts

I have a pandas dataframe and want to filter down to all the rows that contain a certain criteria in the “Title” column.
The rows I want to filter down to are all rows that contain the format “(Axx)” (Where xx are 2 numbers).
The data in the “Title” column doesn’t just consist of “(Axx)” data.
The data in the “Title” column looks like so:
“some_string (Axx)”
What Ive been playing around a bit with different methods but cant seem to get it.
I think the closest ive gotten is:
df.filter(regex=r'(D\d{2})', axis=0))
but its not correct as the entries aren’t being filtered.
Use Series.str.contains with escape () and $ for end of string and filter in boolean indexing:
df = pd.DataFrame({'Title':['(D89)','aaa (D71)','(D5)','(D78) aa','D72']})
print (df)
Title
0 (D89)
1 aaa (D71)
2 (D5)
3 (D78) aa
df1 = df[df['Title'].str.contains(r'\(D\d{2}\)$')]
print (df1)
4 D72
Title
0 (D89)
1 aaa (D71)
If ned match only (Dxx) use Series.str.match:
df2 = df[df['Title'].str.match(r'\(D\d{2}\)$')]
print (df2)
Title
0 (D89)

Pandas group by gender shows more than two groups

I have a dataframe that shows each audience's ranking for a bunch of movies. I wanted to make a list of movies with the most ratings, for each gender.
Here's what I did:
most_rated_gen=lens.groupby(['sex','title']).size().sort_values(ascending=False).to_frame()
I was expecting to see a dataframe that looks something like this:
sex | title
M A
B
C
D
F B
C
D
A
Instead, I got this:
I don't know why it shows M F M F M. Any ideas how I could fix this?
You can use nlargest() if your aggregated column has a name. Assuming the column name is ratings_count. You can use this code.
most_rated_gen.groupby(['sex'])['ratings_count'].nlargest()
Source
As you group by sex, the output will contain the sex column.
You have a shortcut for your operation with value_counts:
df.value_counts(['sex', 'title']).sort_index(kind='mergesort')
If you want your data to be sorted by index while preserving the order of values then you have to use sort_index with kind='mergesort' as parameter.

find string in list of pandas dataframe by part of string

I need to find element in each column list based on part of element. a1 should find by a value.
contains work if column1 is a single string and in case of list in column1 not work
Original Data Frame:
Desire result:
here is the code i tried:
frame = pd.DataFrame({'column1' : [['a-1','b-1','c-1'], ['a-2','b-2','c-2'], ['a-3','b-3','c-3']]})
frame['column1']=frame[frame['column1'].str.contains('a')]
If the order of the list doesn't change, you can try something like
frame['column1'] = frame['column1'].str[0]
frame
Output
column1
0 a-1
1 a-2
2 a-3

Adding new rows in pandas dataframe at specific index

I have read all the answers related to my question available in stackoverflow but my question is little different from available answers. I have very large dataframe and some portion of that dataframe is following-
Input Dataframe is like
A B C D
0 foot 17/1: OGChan_2020011717711829281829281 , 7days ...
1 arm this will processed after ;;;
2 leg go_2020011625692400374400374 16/1: Id Imerys_2020011618188744093744093
3 head xyziemen_2020011510691787006787006 en_2020011510749462801462801 ;;;
: : : :
In this dataframe, firstly I am extracting ID's from column B based upon some regular expression. Some rows of Column B may contain that ID's, some may not and some rows of column B may blank. Following is the code-
df = pd.read_excel("Book1.xlsx", "Sheet1")
dict= {}
for i in df.index:
j = str(df['B'][i])
if(re.findall('_\d{25}', j)):
a = re.findall('_\d{25}', j)
print(a)
dict[i] = a
Regular Expression starts with _(undersore) and 25 digits. Example in above df are _2020011618188744093744093, _2020011510749462801462801 etc..
Now I want to insert these ID's in Column D of a particular row. For Example If two ID's are find at 0th row than first ID should insert in 0th row of column D and second Id should insert on 1st row of column D and all the content of dataframe should shifted down. What I want will clear from following output.I want my output as following based upon above input.
A B .. D
0 foot 17/1: OGChan_2020011717711829281829281 ,7days _2020011717711829281829281
1 arm this will processed after
2 leg go_2020011625692400374400374 16/1: _2020011625692400374400374
Id Imerys_2020011618188744093744093
3 _2020011618188744093744093
4 head xyziemen_2020011510691787006787006 _2020011510691787006787006
en_2020011510749462801462801
5 _2020011510749462801462801
: : : :
In above output 1 ID is found at 0th row.So column D of 0th row contains that ID. No ID is found at first index. So column D of 1st index is empty. At second index there are two ID's. Hence first ID is placed on 2nd row of column D and second ID is placed on 3rd row of column D and it shifted the previous content of third row to 4th row. I want above output as my final output.
Hope I am clear. Thanks in advance

Categories

Resources