Drop rows in pandas if they contains "???"

Drop rows in pandas if they contains "???" - python

Im trying to drop rows in pandas that contains "???", it works for every other value except for "???", I do not know whats the problem.
This is my code (I have tried both types):
df = df[~df["text"].str.contains("?????", na=False)]
df = df[~df["text"].str.contains("?????")]
error that I'm getting:
re.error: nothing to repeat at position 0
It works for every other value except for "????".
I have googled it, and looked all over this website but I couldnt find any solutions.

The parameter expects a regular expression, hence the error re.error.
You can either escape the ? inside the expression like this:
df = df[~df["text"].str.contains("\?\?\?\?\?")]
Or set regex=False as Vorsprung sugested:
df = df[~df["text"].str.contains("?????",regex=False)]

let's convert this into running code:
import numpy as np
import pandas as pd
data = {'A': ['abc', 'cxx???xx', '???',], 'B': ['add', 'ddb', 'c', ]}
df = pd.DataFrame.from_dict(data)
df
output:
A B
0 abc add
1 cxx???xx ddb
2 ??? c
with this:
df[df['A'].str.contains('???',regex=False)]
output:
A B
1 cxx???xx ddb
2 ??? c
you need to tell contains(), that your search string is not a regex.

Related

How to select all observations whose name starts with a specific element in python

I have a dataframe where I want to create a Dummy variable that takes the value 1 when the Asset Class starts with a D. I want to have all variants that start with a D. How would you do it?
The data looks like
dic = {'Asset Class': ['D.1', 'D.12', 'D.34','nan', 'F.3', 'G.12', 'D.2', 'nan']}
df = pd.DataFrame(dic)
What I want to have is
dic_want = {'Asset Class': ['D.1', 'D.12', 'D.34', 'nan', 'F.3', 'G.12', 'D.2', 'nan'],
'Asset Dummy': [1,1,1,0,0,0,1,0]}
df_want = pd.DataFrame(dic_want)
I tried
df_want["Asset Dummy"] = ((df["Asset Class"] == df.filter(like="D"))).astype(int)
where I get the following error message: ValueError: Columns must be same length as key
I also tried
CSDB["test"] = ((CSDB["PAC2"] == CSDB.str.startswith('D'))).astype(int)
where I get the error message AttributeError: 'DataFrame' object has no attribute 'str'.
I tried to transform my object to a string with the standard methos (as.typ(str) and to_string()) but it also does not work. This is probably another problem but I have found only one post with the same question but the post does not have a satisfactory answer.
Any ideas how I can solve my problem?

There are many ways to create a new column based on conditions this is one of them :
import pandas as pd
import numpy as np
dic = {'Asset Class': ['D.1', 'D.12', 'D.34', 'F.3', 'G.12', 'D.2']}
df = pd.DataFrame(dic)
df['Dummy'] = np.where(df['Asset Class'].str.contains("D"), 1, 0)
Here's a link to more : https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/

You can use Series.str.startswith on df['Asset Class']:
>>> dic = {'Asset Class': ['D.1', 'D.12', 'D.34', 'nan', 'F.3', 'G.12', 'D.2', 'nan']}
>>> df = pd.DataFrame(dic)
>>> df['Asset Dummy'] = df['Asset Class'].str.startswith('D').astype(int)
>>> df
Asset Class Asset Dummy
0 D.1 1
1 D.12 1
2 D.34 1
3 nan 0
4 F.3 0
5 G.12 0
6 D.2 1
7 nan 0

Normalize json column and join with rest of dataframe

This is my first question here on stackoverflow so please don't roast me.
I was trying to find similar problems on the internet and actually there are several, but for me the solutions didn't work.
I have created this dataframe:
import pandas as pd
from ast import literal_eval
d = {'order_id': [1], 'email': ["hi#test.com"], 'line_items': ["[{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]"]}
orders = pd.DataFrame(data=d)
It looks like this:
order_id email line_items
1 hi#test.com [{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]
I want the dataframe to look like this:
order_id email line_items.sku line_items.quantity
1 hi#test.com testproduct1 2
1 hi#test.com testproduct2 2
I used the following code to change the type of line_items from string to dict:
orders.line_items = orders.line_items.apply(literal_eval)
Normally I would use json_normalize now to flatten the line_items column. But I also want to keep the id and don't know how to do that. I also want to avoid any loops.
Is there anyone who can help me with this issue?
Kind regards
joant95

If your dictionary really is that strange, then you could try:
d['line_items'] = eval(d['line_items'][0])
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
To create d out of orders you could try:
d = orders.to_dict(orient='list')
Or you could try:
orders.line_items = orders.line_items.map(eval)
d = orders.to_dict(orient='records')
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
But: I still don't have a clear picture of the situation :)

How to cast variable in .query() function to lower case?

I have a df where I want to query while using values from itertuples() from another dataframe:
matching_group = Pandas(Index=0, match='TEST')
df.query("column_a == #matching_group.match")
"matching_group" is coming from df.itertuples() and I can't change that data type. The query above works.
But now I need to cast "matching_group.match" to lowercase.
matching_group = Pandas(Index=0, match='TEST')
df.query("column_a == #matching_group.match.lower()")
This does not work.
It's hard to create a minimal viable example here.
How can I cast a variable used via # in a df.query() to lowercase?

For me working well your code with named tuples, one possible reason for not matching should be trailing whitesapces, then remove them by strip:
df = pd.DataFrame({ 'column_a': ['test', 'tesT', 'No']})
from collections import namedtuple
Pandas = namedtuple('Pandas', 'Index match')
matching_group = Pandas(Index=0, match='TEST')
print (matching_group)
Pandas(Index=0, match='TEST')
df3 = df.query("column_a == #matching_group.match.lower()")
print (df3)
column_a
0 test
df3 = df.query("column_a.str.strip() == #matching_group.match.lower().strip()")

Input Toy Example
df = pd.DataFrame({
'test':['abc', 'DEF'],
'num':[1,2]
})
val='Abc' # variable to be matched
Input df
test num
0 abc 1
1 DEF 2
Code
df.query('test == #val.lower()')
Output
test num
0 abc 1
Tested on pandas version
pd.version # '1.2.4'

How do you sort only the first column in a csv file and not affect any of the other columns while doing so using python

i am working on a table(csv file) where it has the following data:
roll_no,student_name,grade,email_id
1,Aarav Gosalia,Grade 11,aarav.gosalia#flag.org.in
2,Aarav Rawal,Grade 11,aarav.rawal#flag.org.in
3,Abizar Chitalwala,Grade 11,abizar.chitalwala#flag.org.in
4,Ahad Motorwala,Grade 11,ahad.motorwala#flag.org.in
5,Armaan Adenwala,Grade 11,armaan.adenwala#flag.org.in
6,Aryan Shah,Grade 11,aryan.shah#flag.org.in
7,Baasit Motorwala,Grade 11,baasit.motorwala#flag.org.in
16,Caroline Walker,Grade 11,caroline.walker#flag.org.in
8,Darsshan Kavedia,Grade 11,darsshan.kavedia#flag.org.in
9,Devanshi Rajgharia,Grade 11,devanshi.rajgharia#flag.org.in
10,Dhruv Jain,Grade 11,dhruv.jain#flag.org.in
11,Eisa Patel,Grade 11,eisa.patel#flag.org.in
12,Esha Khimawat,Grade 11,esha.khimawat#flag.org.in
13,Fatima Unwala,Grade 11,fatima.unwala#flag.org.in
14,Hamza Erfan,Grade 11,hamza.erfan#flag.org.in
15,Harsh Gosar,Grade 11,harsh.gosar#flag.org.in
so as you can see all of the names are sorted but the roll number of caroline walker is 16. so i want a way to sort only the roll numbers and not affect any of the other columns while doing so.
I want the final table to look like this:
roll_no,student_name,grade,email_id
1,Aarav Gosalia,Grade 11,aarav.gosalia#flag.org.in
2,Aarav Rawal,Grade 11,aarav.rawal#flag.org.in
3,Abizar Chitalwala,Grade 11,abizar.chitalwala#flag.org.in
4,Ahad Motorwala,Grade 11,ahad.motorwala#flag.org.in
5,Armaan Adenwala,Grade 11,armaan.adenwala#flag.org.in
6,Aryan Shah,Grade 11,aryan.shah#flag.org.in
7,Baasit Motorwala,Grade 11,baasit.motorwala#flag.org.in
8,Caroline Walker,Grade 11,caroline.walker#flag.org.in
9,Darsshan Kavedia,Grade 11,darsshan.kavedia#flag.org.in
10,Devanshi Rajgharia,Grade 11,devanshi.rajgharia#flag.org.in
11,Dhruv Jain,Grade 11,dhruv.jain#flag.org.in
12,Eisa Patel,Grade 11,eisa.patel#flag.org.in
13,Esha Khimawat,Grade 11,esha.khimawat#flag.org.in
14,Fatima Unwala,Grade 11,fatima.unwala#flag.org.in
15,Hamza Erfan,Grade 11,hamza.erfan#flag.org.in
16,Harsh Gosar,Grade 11,harsh.gosar#flag.org.in
Please help me and keep in mind that i am yet a beginner in python.

just use pandas. df['roll_no'] = range(1:len(df))

pandas is one way.
import pandas as pd
df = pd.read_csv('file.csv')
df.sort_values(by='roll_no')
df.to_csv('file.csv',index=False)

This will work:
df = pd.DataFrame({'id':[1,3,2,7],
'name': ['M', 'r', 'd', 'd']})
df['id'] = list(df['id'].sort_values())
df
Result:
id name
0 1 M
1 2 r
2 3 d
3 7 d

python pandas - function applied to csv is not persisted

I need to polish a csv dataset, but it seems the changes are not applied to the dataset itslef.
CSV is in this format:
ID, TRACK_LINK
761607, https://mylink.com//track/...
This is my script:
import pandas as pd
df = pd.read_csv('./file.csv').fillna('')
# remove double // from TRACK_LINK
def polish_track_link(track_link):
return track_link.replace("//track", "/track")
df['LINK'].apply(polish_track_link)
print(df)
this prints something like:
...
761607 https://mylink.com//track/...
note the //track
If I do print(df['LINK'].apply(polish_track_link)) I get:
...
761607, https://mylink.com/track/...
So the function polish_track_link works but it's not applied to the dataset. Any idea why?

You need assign back:
df['TRACK_LINK'] = df['TRACK_LINK'].apply(polish_track_link)
But better is use pandas functions str.replace or replace with regex=True for replace substrings:
df['TRACK_LINK'] = df['TRACK_LINK'].str.replace("//track", "/track")
Or:
df['TRACK_LINK'] = df['TRACK_LINK'].replace("//track", "/track", regex=True)
print(df)
ID TRACK_LINK
0 761607 https://mylink.com/track/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop rows in pandas if they contains "???" - python

The parameter expects a regular expression, hence the error re.error. You can either escape the ? inside the expression like this: df = df[~df["text"].str.contains("\?\?\?\?\?")] Or set regex=False as Vorsprung sugested: df = df[~df["text"].str.contains("?????",regex=False)]

Related

How to select all observations whose name starts with a specific element in python

Normalize json column and join with rest of dataframe

How to cast variable in .query() function to lower case?

How do you sort only the first column in a csv file and not affect any of the other columns while doing so using python

python pandas - function applied to csv is not persisted

Categories

Resources