Drop rows in pandas if they contains "???" - python

Im trying to drop rows in pandas that contains "???", it works for every other value except for "???", I do not know whats the problem.
This is my code (I have tried both types):
df = df[~df["text"].str.contains("?????", na=False)]
df = df[~df["text"].str.contains("?????")]
error that I'm getting:
re.error: nothing to repeat at position 0
It works for every other value except for "????".
I have googled it, and looked all over this website but I couldnt find any solutions.

The parameter expects a regular expression, hence the error re.error.
You can either escape the ? inside the expression like this:
df = df[~df["text"].str.contains("\?\?\?\?\?")]
Or set regex=False as Vorsprung sugested:
df = df[~df["text"].str.contains("?????",regex=False)]

let's convert this into running code:
import numpy as np
import pandas as pd
data = {'A': ['abc', 'cxx???xx', '???',], 'B': ['add', 'ddb', 'c', ]}
df = pd.DataFrame.from_dict(data)
df
output:
A B
0 abc add
1 cxx???xx ddb
2 ??? c
with this:
df[df['A'].str.contains('???',regex=False)]
output:
A B
1 cxx???xx ddb
2 ??? c
you need to tell contains(), that your search string is not a regex.

Related

How to select all observations whose name starts with a specific element in python

I have a dataframe where I want to create a Dummy variable that takes the value 1 when the Asset Class starts with a D. I want to have all variants that start with a D. How would you do it?
The data looks like
dic = {'Asset Class': ['D.1', 'D.12', 'D.34','nan', 'F.3', 'G.12', 'D.2', 'nan']}
df = pd.DataFrame(dic)
What I want to have is
dic_want = {'Asset Class': ['D.1', 'D.12', 'D.34', 'nan', 'F.3', 'G.12', 'D.2', 'nan'],
'Asset Dummy': [1,1,1,0,0,0,1,0]}
df_want = pd.DataFrame(dic_want)
I tried
df_want["Asset Dummy"] = ((df["Asset Class"] == df.filter(like="D"))).astype(int)
where I get the following error message: ValueError: Columns must be same length as key
I also tried
CSDB["test"] = ((CSDB["PAC2"] == CSDB.str.startswith('D'))).astype(int)
where I get the error message AttributeError: 'DataFrame' object has no attribute 'str'.
I tried to transform my object to a string with the standard methos (as.typ(str) and to_string()) but it also does not work. This is probably another problem but I have found only one post with the same question but the post does not have a satisfactory answer.
Any ideas how I can solve my problem?
There are many ways to create a new column based on conditions this is one of them :
import pandas as pd
import numpy as np
dic = {'Asset Class': ['D.1', 'D.12', 'D.34', 'F.3', 'G.12', 'D.2']}
df = pd.DataFrame(dic)
df['Dummy'] = np.where(df['Asset Class'].str.contains("D"), 1, 0)
Here's a link to more : https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
You can use Series.str.startswith on df['Asset Class']:
>>> dic = {'Asset Class': ['D.1', 'D.12', 'D.34', 'nan', 'F.3', 'G.12', 'D.2', 'nan']}
>>> df = pd.DataFrame(dic)
>>> df['Asset Dummy'] = df['Asset Class'].str.startswith('D').astype(int)
>>> df
Asset Class Asset Dummy
0 D.1 1
1 D.12 1
2 D.34 1
3 nan 0
4 F.3 0
5 G.12 0
6 D.2 1
7 nan 0

Normalize json column and join with rest of dataframe

This is my first question here on stackoverflow so please don't roast me.
I was trying to find similar problems on the internet and actually there are several, but for me the solutions didn't work.
I have created this dataframe:
import pandas as pd
from ast import literal_eval
d = {'order_id': [1], 'email': ["hi#test.com"], 'line_items': ["[{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]"]}
orders = pd.DataFrame(data=d)
It looks like this:
order_id email line_items
1 hi#test.com [{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]
I want the dataframe to look like this:
order_id email line_items.sku line_items.quantity
1 hi#test.com testproduct1 2
1 hi#test.com testproduct2 2
I used the following code to change the type of line_items from string to dict:
orders.line_items = orders.line_items.apply(literal_eval)
Normally I would use json_normalize now to flatten the line_items column. But I also want to keep the id and don't know how to do that. I also want to avoid any loops.
Is there anyone who can help me with this issue?
Kind regards
joant95
If your dictionary really is that strange, then you could try:
d['line_items'] = eval(d['line_items'][0])
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
To create d out of orders you could try:
d = orders.to_dict(orient='list')
Or you could try:
orders.line_items = orders.line_items.map(eval)
d = orders.to_dict(orient='records')
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
But: I still don't have a clear picture of the situation :)

How to cast variable in .query() function to lower case?

I have a df where I want to query while using values from itertuples() from another dataframe:
matching_group = Pandas(Index=0, match='TEST')
df.query("column_a == #matching_group.match")
"matching_group" is coming from df.itertuples() and I can't change that data type. The query above works.
But now I need to cast "matching_group.match" to lowercase.
matching_group = Pandas(Index=0, match='TEST')
df.query("column_a == #matching_group.match.lower()")
This does not work.
It's hard to create a minimal viable example here.
How can I cast a variable used via # in a df.query() to lowercase?
For me working well your code with named tuples, one possible reason for not matching should be trailing whitesapces, then remove them by strip:
df = pd.DataFrame({ 'column_a': ['test', 'tesT', 'No']})
from collections import namedtuple
Pandas = namedtuple('Pandas', 'Index match')
matching_group = Pandas(Index=0, match='TEST')
print (matching_group)
Pandas(Index=0, match='TEST')
df3 = df.query("column_a == #matching_group.match.lower()")
print (df3)
column_a
0 test
df3 = df.query("column_a.str.strip() == #matching_group.match.lower().strip()")
Input Toy Example
df = pd.DataFrame({
'test':['abc', 'DEF'],
'num':[1,2]
})
val='Abc' # variable to be matched
Input df
test num
0 abc 1
1 DEF 2
Code
df.query('test == #val.lower()')
Output
test num
0 abc 1
Tested on pandas version
pd.version # '1.2.4'

How do you sort only the first column in a csv file and not affect any of the other columns while doing so using python

i am working on a table(csv file) where it has the following data:
roll_no,student_name,grade,email_id
1,Aarav Gosalia,Grade 11,aarav.gosalia#flag.org.in
2,Aarav Rawal,Grade 11,aarav.rawal#flag.org.in
3,Abizar Chitalwala,Grade 11,abizar.chitalwala#flag.org.in
4,Ahad Motorwala,Grade 11,ahad.motorwala#flag.org.in
5,Armaan Adenwala,Grade 11,armaan.adenwala#flag.org.in
6,Aryan Shah,Grade 11,aryan.shah#flag.org.in
7,Baasit Motorwala,Grade 11,baasit.motorwala#flag.org.in
16,Caroline Walker,Grade 11,caroline.walker#flag.org.in
8,Darsshan Kavedia,Grade 11,darsshan.kavedia#flag.org.in
9,Devanshi Rajgharia,Grade 11,devanshi.rajgharia#flag.org.in
10,Dhruv Jain,Grade 11,dhruv.jain#flag.org.in
11,Eisa Patel,Grade 11,eisa.patel#flag.org.in
12,Esha Khimawat,Grade 11,esha.khimawat#flag.org.in
13,Fatima Unwala,Grade 11,fatima.unwala#flag.org.in
14,Hamza Erfan,Grade 11,hamza.erfan#flag.org.in
15,Harsh Gosar,Grade 11,harsh.gosar#flag.org.in
so as you can see all of the names are sorted but the roll number of caroline walker is 16. so i want a way to sort only the roll numbers and not affect any of the other columns while doing so.
I want the final table to look like this:
roll_no,student_name,grade,email_id
1,Aarav Gosalia,Grade 11,aarav.gosalia#flag.org.in
2,Aarav Rawal,Grade 11,aarav.rawal#flag.org.in
3,Abizar Chitalwala,Grade 11,abizar.chitalwala#flag.org.in
4,Ahad Motorwala,Grade 11,ahad.motorwala#flag.org.in
5,Armaan Adenwala,Grade 11,armaan.adenwala#flag.org.in
6,Aryan Shah,Grade 11,aryan.shah#flag.org.in
7,Baasit Motorwala,Grade 11,baasit.motorwala#flag.org.in
8,Caroline Walker,Grade 11,caroline.walker#flag.org.in
9,Darsshan Kavedia,Grade 11,darsshan.kavedia#flag.org.in
10,Devanshi Rajgharia,Grade 11,devanshi.rajgharia#flag.org.in
11,Dhruv Jain,Grade 11,dhruv.jain#flag.org.in
12,Eisa Patel,Grade 11,eisa.patel#flag.org.in
13,Esha Khimawat,Grade 11,esha.khimawat#flag.org.in
14,Fatima Unwala,Grade 11,fatima.unwala#flag.org.in
15,Hamza Erfan,Grade 11,hamza.erfan#flag.org.in
16,Harsh Gosar,Grade 11,harsh.gosar#flag.org.in
Please help me and keep in mind that i am yet a beginner in python.
just use pandas. df['roll_no'] = range(1:len(df))
pandas is one way.
import pandas as pd
df = pd.read_csv('file.csv')
df.sort_values(by='roll_no')
df.to_csv('file.csv',index=False)
This will work:
df = pd.DataFrame({'id':[1,3,2,7],
'name': ['M', 'r', 'd', 'd']})
df['id'] = list(df['id'].sort_values())
df
Result:
id name
0 1 M
1 2 r
2 3 d
3 7 d

python pandas - function applied to csv is not persisted

I need to polish a csv dataset, but it seems the changes are not applied to the dataset itslef.
CSV is in this format:
ID, TRACK_LINK
761607, https://mylink.com//track/...
This is my script:
import pandas as pd
df = pd.read_csv('./file.csv').fillna('')
# remove double // from TRACK_LINK
def polish_track_link(track_link):
return track_link.replace("//track", "/track")
df['LINK'].apply(polish_track_link)
print(df)
this prints something like:
...
761607 https://mylink.com//track/...
note the //track
If I do print(df['LINK'].apply(polish_track_link)) I get:
...
761607, https://mylink.com/track/...
So the function polish_track_link works but it's not applied to the dataset. Any idea why?
You need assign back:
df['TRACK_LINK'] = df['TRACK_LINK'].apply(polish_track_link)
But better is use pandas functions str.replace or replace with regex=True for replace substrings:
df['TRACK_LINK'] = df['TRACK_LINK'].str.replace("//track", "/track")
Or:
df['TRACK_LINK'] = df['TRACK_LINK'].replace("//track", "/track", regex=True)
print(df)
ID TRACK_LINK
0 761607 https://mylink.com/track/

Categories

Resources