Python Pandas drop - python

I build a script with Python and i use Pandas.
I'm trying to delete line from a dataframe.
I want to delete lines that contains empty values into two specific columns.
If one of those two column is regularly completed but not the other one, the line is preserved.
So i have build this code that works. But i'm beginner and i am sure that i can simplify my work.
I'm sure i don't need loop "for" in my function. I think there is a way with a good method. I read the doc on internet but i found nothing.
I try my best but i need help.
Also for some reasons i don't want to use numpy.
So here my code :
import pandas as pnd
def drop_empty_line(df):
a = df[(df["B"].isna()) & (df["C"].isna())].index
for i in a:
df = df.drop([i])
return df
def main():
df = pnd.DataFrame({
"A": [5, 0, 4, 6, 5],
"B": [pnd.NA, 4, pnd.NA, pnd.NA, 5],
"C": [pnd.NA, pnd.NA, 9, pnd.NA, 8],
"D": [5, 3, 8, 5, 2],
"E": [pnd.NA, 4, 2, 0, 3]
})
print(drop_empty_line(df))
if __name__ == '__main__':
main()

You indeed don't need a loop. You don't even need a custom function, there is already dropna:
df = df.dropna(subset=['B', 'C'], how='all')
# or in place:
# df.dropna(subset=['B', 'C'], how='all', inplace=True)
output:
A B C D E
1 0 4 <NA> 3 4
2 4 <NA> 9 8 2
4 5 5 8 2 3

Related

How to fix error: 'AnnAssign' nodes are not implemented in Python

I try to do following:
import pandas as pd
d = {'col1': [1, 7, 3, 6], 'col2': [3, 4, 9, 1]}
df = pd.DataFrame(data=d)
out = df.query('col1 > col2')
out= col1 col2
1 7 4
3 6 1
This works OK. But when I modify column name col1 --> col1:suf
d = {'col1:suf': [1, 7, 3, 6], 'col2': [3, 4, 9, 1]}
df = pd.DataFrame(data=d)
out = df.query('col1:suf > col2')
I get an error:
'AnnAssign' nodes are not implemented
Is there easy way to avoid this behavior? Or course renaming headers etc. is a workaround
The colon : is a special character in SQL queries. You need to enclose it in backticks.
Try this :
out = df.query('`col1:suf` > col2')
Output :
print(out)
col1:suf col2
1 7 4
3 6 1
According to ValentinFFM's comment on this issue, you need to put a backtick quote around your column name like
df.query('`Column: Name`==value')

Proper way to do this in pandas without using for loop

The question is I would like to avoid iterrows here.
From my dataframe I want to create a new column "unique" that will be based on the condition that if "a" and "b" column values are the same I would give it a value "uniqueN" then for all occurrence of the exact "a" and "b" I would need the same value "uniqueN".
In this case
"1", "3" (the first row) from "a" and "b" is the first unique pair, so I give that the value "unique1", and the seventh row will also have the same value which is "unique1" as it is also "1", "3".
"2", "2" (the second row) is the next unique "a", "b" pair so I give them "unique2" and the eight row also has "2", "2" so that will also have "unique2".
"3", "1" (third row) is the next unique, so "unique3", no more rows in the df is "3", "1" so that value wont repeat.
and so on
I have a working code that uses loops but this is not the pandas way, can anyone suggest how I can do this using pandas functions?
Expected Output (My code works, but its not using pandas methods)
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Code
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
c = 1
seen = {}
for i, j in df.iterrows():
j = tuple(j)
if j not in seen:
seen[j] = 'unique' + str(c)
c += 1
for key, value in seen.items():
df.loc[(df.a == key[0]) & (df.b == key[1]), 'unique'] = value
Let's use groupby ngroup with sort=False to ensure values are enumerated in order of appearance, add 1 so group numbers start at one, then convert to string with astype so we can add the prefix unique to the number:
df['unique'] = 'unique' + \
df.groupby(['a', 'b'], sort=False).ngroup().add(1).astype(str)
Or with map and format instead of converting and concatenating:
df['unique'] = (
df.groupby(['a', 'b'], sort=False).ngroup()
.add(1)
.map('unique{}'.format)
)
df:
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Setup:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]
})
I came up with a slightly different solution. I'll add this for posterity, but the groupby answer is superior.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
print(df)
df1 = df[~df.duplicated()]
print(df1)
df1['unique'] = df1.index
print(df1)
df2 = df.merge(df1, how='left')
print(df2)

Is there a way to export to a CSV file by specifying rows in Pandas DataFrame?

New coder here. I'm sort of aware how to export columns from a dataframe into a csv file but would like to know how to do the same sort of thing with rows. Below is an example of what I tried:
from pandas import DataFrame
x = [1, 2, 3, 4]
y = [7, 8, 9, 10]
dataSet = {"X": x, "Y": y}
df = DataFrame(dataSet, rows=["X", "Y"])
df.to_csv("rowstest.csv")
I would like the csv file to look like this:
X, 1, 2, 3, 4
Y, 7, 8, 9, 10
Is there a way I can do this?
I appreciate any and all help!
Use DataFrame.from_dict first and then not write default columns names in DataFrame.to_csv by header=False parameter:
x = [1, 2, 3, 4]
y = [7, 8, 9, 10]
dataSet = {"X": x, "Y": y}
df = pd.DataFrame.from_dict(dataSet, orient='index')
print (df)
0 1 2 3
X 1 2 3 4
Y 7 8 9 10
df.to_csv("rowstest.csv", header=False)
dataSet.T.to_csv(...)
will transpose your columns to rows and rows to columns which i think gives you the output you want

concatenate in place in sub function with pandas concat function?

I'm trying to write a function that take a pandas Dataframe as argument and at some concatenate this datagframe with another.
for exemple:
def concat(df):
df = pd.concat((df, pd.DataFrame({'E': [1, 1, 1]})), axis=1)
I would like this function to modify in place the input df but I can't find how to achieve this. When I do
...
print(df)
concat(df)
print(df)
The dataframe df is identical before and after the function call
Note: I don't want to do df['E'] = [1, 1, 1] because I don't know how many column will be added to df. So I want to use pd.concat(), if possible...
This will edit the original DataFrame inplace and give the desired output as long as the new data contains the same number of rows as the original, and there are no conflicting column names.
It's the same idea as your df['E'] = [1, 1, 1] suggestion, except it will work for an arbitrary number of columns.
I don't think there is a way to achieve this using pd.concat, as it doesn't have an inplace parameter as some Pandas functions do.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [10, 20, 30], 'D': [40, 50, 60]})
df[df2.columns] = df2
Results (df):
A B C D
0 1 4 10 40
1 2 5 20 50
2 3 6 30 60

Repl pastable DataFrame representation

Is there a way to get a textual representation of a dataframe that I can just paste back into the repl, but that still looks good as a table? Numpy repr manages this pretty well, I'm talking something like:
> df
A B C
i
0 3 1 8
1 3 1 6
2 7 4 6
> df.to_python()
DataFrame(
columns=['i', 'A', 'B', 'C'],
data = [[ 0, 30, 1, 8],
[ 1, 3, 1, 6],
[ 2, 7, 4, 6]]
).set_index('i')
This seems like it would be especially useful for stack overflow, but I often find myself needing to share small dataframes and would love it if this were possible.
Edit: I know about to_csv and to_dict and so on, what I want is a way of exactly reproducing a dataframe that also can be read as a table. It seems that this probably doesn't have a current answer (although I'd love to see pandas add it), but I think I can make pd.read_clipboard('\s\s+') work for 95% of my usages.
StringIO tells python to treat a string as a filelike object which allows you to use the read_csv method example below...
df = """ A B C
i
0 3 1 8
1 3 1 6
2 7 4 6"""#this is equivalent to str(df) or what happens when you use print df
df = pd.read_csv(StringIO.StringIO(df),sep="\s*",engine = 'python')
df.to_dict() will get you close, although you do lose the index name:
df.to_dict()
Out[5]: {'A': {0: 30, 1: 3, 2: 7}, 'B': {0: 1, 1: 1, 2: 4}, 'C': {0: 8, 1: 6, 2: 6}}
df_copy = pd.DataFrame(df.to_dict())
df_copy
Out[7]:
A B C
0 30 1 8
1 3 1 6
2 7 4 6

Categories

Resources