Removing commented rows in place in pandas - python

I have a dataframe that may have commented characters at the bottom of it. Due to some other reasons, I cannot pass the comment character to initialize the dataframe itself. Here is an example of what I would have:
df = pd.read_csv(file,header=None)
df
0 1
0 132605 1
1 132750 2
2 # total: 100000
Is there a way to remove all rows that start with a comment character in-place -- that is, without having to re-load the data frame?

Using startswith
newdf=df[df.iloc[:,0].str.startswith('#').ne(True)]

Dataframe:
>>> df
0 1
0 132605 1
1 132750 2
2 # total: 100000
3 foo bar
Dropping in-place:
>>> to_drop = df[0].str.startswith('#').where(lambda s: s).dropna().index
>>> df.drop(to_drop, inplace=True)
>>> df
0 1
0 132605 1
1 132750 2
3 foo bar
Assumptions: you want to find rows where the column labeled 0 starts with '#'. Otherwise, adjust accordingly.

Related

pandas combine multiple row into one, and update other columns [duplicate]

I have this dataframe and I need to drop all duplicates but I need to keep first AND last values
For example:
1 0
2 0
3 0
4 0
output:
1 0
4 0
I tried df.column.drop_duplicates(keep=("first","last")) but it doesn't word, it returns
ValueError: keep must be either "first", "last" or False
Does anyone know any turn around for this?
Thanks
You could use the panda's concat function to create a dataframe with both the first and last values.
pd.concat([
df['X'].drop_duplicates(keep='first'),
df['X'].drop_duplicates(keep='last'),
])
you can't drop both first and last... so trick is too concat data frames of first and last.
When you concat one has to handle creating duplicate of non-duplicates. So only concat unique indexes in 2nd Dataframe. (not sure if Merge/Join would work better?)
import pandas as pd
d = {1:0,2:0,10:1, 3:0,4:0}
df = pd.DataFrame.from_dict(d, orient='index', columns=['cnt'])
print(df)
cnt
1 0
2 0
10 1
3 0
4 0
Then do this:
d1 = df.drop_duplicates(keep=("first"))
d2 = df.drop_duplicates(keep=("last"))
d3 = pd.concat([d1,d2.loc[set(d2.index) - set(d1.index)]])
d3
Out[60]:
cnt
1 0
10 1
4 0
Use a groupby on your column named column, then reindex. If you ever want to check for duplicate values in more than one column, you can extend the columns you include in your groupby.
df = pd.DataFrame({'column':[0,0,0,0]})
Input:
column
0 0
1 0
2 0
3 0
df.groupby('column', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[0, -1]]).reset_index(level=0, drop=True)
Output:
column
0 0
3 0

Ascending order of Excel rows

I have the following rows in Excel:
How can I put them in an ascending order in Python (i.e. notice how the row starting with 12 comes before that starting with 118).
I think the Pandas library would be a starting point? Any clue is appreciated.
Thanks.
First read the excel file
df = pd.read_excel("your/file/path/file.xls")
df
data
0 1212.i.jpg
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
then make a substring of the data
assuming the column name is "data"
df["sub"] = df["data"].str[:-6]
Just in case, convert the new column to type int
df["sub"] = df["sub"].astype(int)
Now sort the values, by that new column
df.sort_values("sub", inplace=True)
Finaly, if you only want your original data:
df = df["data"]
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg
Using natsorted
from natsort import natsorted
df.data=natsorted(df.data)
df
Out[129]:
data
0 121.i.jpg
1 212.i.jpg
2 512.i.jpg
3 1212.i.jpg
Keep original data index
df.loc[natsorted(df.index,key=lambda x : df.data[x] )]
Out[138]:
data
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg
Or using argsort with split
df.iloc[np.argsort(df.data.str.split('.').str[0].astype(int))]
Out[141]:
data
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg

Data cleaning: Remove 0 value from my dataset having a header and index_col

I have a dataset showing below.
What I would like to do is three things.
Step 1: AA to CC is an index, however, happy to keep in the dataset for the future purpose.
Step 2: Count 0 value to each row.
Step 3: If 0 is more than 20% in the row, which means more than 2 in this case because DD to MM is 10 columns, remove the row.
So I did a stupid way to achieve above three steps.
df = pd.read_csv("dataset.csv", header=None)
df_bool = (df == "0")
print(df_bool.sum(axis=1))
then I got an expected result showing below.
0 0
1 0
2 1
3 0
4 1
5 8
6 1
7 0
So removed the row #5 as I indicated below.
df2 = df.drop([5], axis=0)
print(df2)
This works well even this is not an elegant, kind of a stupid way to go though.
However, if I import my dataset as header=0, then this approach did not work at all.
df = pd.read_csv("dataset.csv", header=0)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
How come this happens?
Also, if I would like to write a code with loop, count and drop functions, what does the code look like?
You can just continue using boolean_indexing:
First we calculate number of columns and number of zeroes per row:
n_columns = len(df.columns) # or df.shape[1]
zeroes = (df == "0").sum(axis=1)
We then select only rows that have less than 20 % zeroes.
proportion_zeroes = zeroes / n_columns
max_20 = proportion_zeroes < 0.20
df[max_20] # This will contain only rows that have less than 20 % zeroes
One liner:
df[((df == "0").sum(axis=1) / len(df.columns)) < 0.2]
It would have been great if you could have posted how the dataframe looks in pandas rather than a picture of an excel file. However, constructing a dummy df
df = pd.DataFrame({'index1':['a','b','c'],'index2':['b','g','f'],'index3':['w','q','z']
,'Col1':[0,1,0],'Col2':[1,1,0],'Col3':[1,1,1],'Col4':[2,2,0]})
Step1, assigning the index can be done using the .set_index() method as per below
df.set_index(['index1','index2','index3'],inplace=True)
instead of doing everything manually when it comes fo filtering out, you can use the return you got from df_bool.sum(axis=1) in the filtering of the dataframe as per below
df.loc[(df==0).sum(axis=1) / (df.shape[1])>0.6]
index1 index2 index3 Col1 Col2 Col3 Col4
c f z 0 0 1 0
and using that you can drop those rows, assuming 20% then you would use
df = df.loc[(df==0).sum(axis=1) / (df.shape[1])<0.2]
Ween it comes to the header issue it's a bit difficult to answer without seeing the what the file or dataframe looks like

Grouping data in Python with pandas yields a blank first row

I have this nice pandas dataframe:
And I want to group it by the column "0" (which represents the year) and calculate the mean of the other columns for each year. I do such thing with this code:
df.groupby(0)[2,3,4].mean()
And that successfully calculates the mean of every column. The problem here being the empty row that appears on top:
That's just a display thing, the grouped column now becomes the index and this is just the way that it is displayed, you will notice here that even when you set pd.set_option('display.notebook_repr_html', False) you still get this line, it has no effect on operations on the goruped df:
In [30]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5), 'c':np.arange(5)})
df
Out[30]:
a b c
0 0.766706 -0.575700 0
1 0.594797 -0.966856 1
2 1.852405 1.003855 2
3 -0.919870 -1.089215 3
4 -0.647769 -0.541440 4
In [31]:
df.groupby('c')['a','b'].mean()
Out[31]:
a b
c
0 0.766706 -0.575700
1 0.594797 -0.966856
2 1.852405 1.003855
3 -0.919870 -1.089215
4 -0.647769 -0.541440
Technically speaking it has assigneed the name attribute:
In [32]:
df.groupby('c')['a','b'].mean().index.name
Out[32]:
'c'
by default there will be no name if it has not been assigned:
In [34]:
print(df.index.name)
None

Fastest way to copy columns from one DataFrame to another using pandas?

I have a large DataFrame (million +) records I'm using to store core of my data (like a database) and I then have a smaller DataFrame (1 to 2000) records that I'm combining a few of the columns for each time step in my program which can be several thousand time steps . Both DataFrames are indexed the same way by a id column.
the code I'm using is:
df_large.loc[new_ids, core_cols] = df_small.loc[new_ids, core_cols]
Where core_cols is a list of about 10 fields that I'm coping over and new_ids are the ids from the small DataFrame. This code works fine but it is the slowest part of my code my a magnitude of three. I just wanted to know if they was a faster way to merge the data of the two DataFrame together.
I tried merging the data each time with the merge function but process took way to long that is way I have gone to creating a larger DataFrame that I update to improve the speed.
There is nothing inherently slow about using .loc to set with an alignable frame, though it does go through a bit of code to cover lot of cases, so probably it's not ideal to have in a tight loop. FYI, this example is slightly different that the 2nd example.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from pandas import DataFrame
In [4]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [5]: df
Out[5]:
0 1 2
a 1 1 1
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 1 1
g 1 1 1
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
In [6]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [7]: df2
Out[7]:
1 2
a 0 0
f 0 0
g 0 0
[3 rows x 2 columns]
In [8]: df.loc[df2.index,df2.columns] = df2
In [9]: df
Out[9]:
0 1 2
a 1 0 0
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
Here's an alternative. It may or may not fit your data pattern. If the updates (your small frame) are pretty much independent this would work (IOW you are not updating the big frame, then picking out a new sub-frame, then updating, etc. - if this is your pattern, then using .loc is about right).
Instead of updating the big frame, update the small frame with the columns from the big frame, e.g.:
In [10]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [11]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [12]: needed_columns = df.columns-df2.columns
In [13]: df2[needed_columns] = df.reindex(index=df2.index,columns=needed_columns)
In [14]: df2
Out[14]:
1 2 0
a 0 0 1
f 0 0 1
g 0 0 1
[3 rows x 3 columns]
In [15]: df3 = DataFrame(0,index=list('cji'),columns=[1,2])
In [16]: needed_columns = df.columns-df3.columns
In [17]: df3[needed_columns] = df.reindex(index=df3.index,columns=needed_columns)
In [18]: df3
Out[18]:
1 2 0
c 0 0 1
j 0 0 1
i 0 0 1
[3 rows x 3 columns]
And concat everything together when you want (they are kept in a list in the mean time, or see my comments below, these sub-frames could be moved to external storage when created, then read back before this concatenating step).
In [19]: pd.concat([ df.reindex(index=df.index-df2.index-df3.index), df2, df3]).reindex_like(df)
Out[19]:
0 1 2
a 1 0 0
b 1 1 1
c 1 0 0
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 0 0
j 1 0 0
[10 rows x 3 columns]
The beauty of this pattern is that it is easily extended to using an actual db (or much better an HDFStore), to actually store the 'database', then creating/updating sub-frames as needed, then writing out to a new store when finished.
I use this pattern all of the time, though with Panels actually.
perform a computation on a sub-set of the data and write each to a separate file
then at the end read them all in and concat (in memory), and write out a gigantic new file. The concat step could be done all at once in memory, or if truly a large task, then can be done iteratively.
I am able to use multi-processes to perform my computations AND write each individual Panel to a file separate as they are all completely independent. The only dependent part is the concat.
This is essentially a map-reduce pattern.
Quickly: Copy columns a and b from the old df into a new df.
df1 = df[['a', 'b']]
I've had to copy between large dataframes a fair bit. I'm using dataframes with realtime market data, which may not be what pandas is designed for, but this is my experience..
On my pc, copying a single datapoint with .at takes 15µs with the df size making negligible difference. .loc takes a minimum of 550µs and increases as the df gets larger: 3100µs to copy a single point from one 100000x2 df to another. .ix seems to be just barely faster than .loc.
For a single datapoint .at is very fast and is not impacted by the size of the dataframe, but it cannot handle ranges so loops are required, and as such the time scaling is linear. .loc and .ix on the other hand are (relatively) very slow for single datapoints, but they can handle ranges and scale up better than linearly. However, unlike .at they slow down significantly wrt dataframe size.
Therefore when I'm frequently copying small ranges between large dataframes, I tend to use .at with a for loop, and otherwise I use .ix with a range.
for new_id in new_ids:
for core_col in core_cols:
df_large.at[new_id, core_col] = df_small.at[new_id, core_col]
Of course, to do it properly I'd go with Jeff's solution above, but it's nice to have options.
Caveats of .at: it doesn't work with ranges, and it doesn't work if the dtype is datetime (and maybe others).

Categories

Resources