I have the following rows in Excel:
How can I put them in an ascending order in Python (i.e. notice how the row starting with 12 comes before that starting with 118).
I think the Pandas library would be a starting point? Any clue is appreciated.
Thanks.
First read the excel file
df = pd.read_excel("your/file/path/file.xls")
df
data
0 1212.i.jpg
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
then make a substring of the data
assuming the column name is "data"
df["sub"] = df["data"].str[:-6]
Just in case, convert the new column to type int
df["sub"] = df["sub"].astype(int)
Now sort the values, by that new column
df.sort_values("sub", inplace=True)
Finaly, if you only want your original data:
df = df["data"]
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg
Using natsorted
from natsort import natsorted
df.data=natsorted(df.data)
df
Out[129]:
data
0 121.i.jpg
1 212.i.jpg
2 512.i.jpg
3 1212.i.jpg
Keep original data index
df.loc[natsorted(df.index,key=lambda x : df.data[x] )]
Out[138]:
data
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg
Or using argsort with split
df.iloc[np.argsort(df.data.str.split('.').str[0].astype(int))]
Out[141]:
data
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg
Related
I have this dataframe and I need to drop all duplicates but I need to keep first AND last values
For example:
1 0
2 0
3 0
4 0
output:
1 0
4 0
I tried df.column.drop_duplicates(keep=("first","last")) but it doesn't word, it returns
ValueError: keep must be either "first", "last" or False
Does anyone know any turn around for this?
Thanks
You could use the panda's concat function to create a dataframe with both the first and last values.
pd.concat([
df['X'].drop_duplicates(keep='first'),
df['X'].drop_duplicates(keep='last'),
])
you can't drop both first and last... so trick is too concat data frames of first and last.
When you concat one has to handle creating duplicate of non-duplicates. So only concat unique indexes in 2nd Dataframe. (not sure if Merge/Join would work better?)
import pandas as pd
d = {1:0,2:0,10:1, 3:0,4:0}
df = pd.DataFrame.from_dict(d, orient='index', columns=['cnt'])
print(df)
cnt
1 0
2 0
10 1
3 0
4 0
Then do this:
d1 = df.drop_duplicates(keep=("first"))
d2 = df.drop_duplicates(keep=("last"))
d3 = pd.concat([d1,d2.loc[set(d2.index) - set(d1.index)]])
d3
Out[60]:
cnt
1 0
10 1
4 0
Use a groupby on your column named column, then reindex. If you ever want to check for duplicate values in more than one column, you can extend the columns you include in your groupby.
df = pd.DataFrame({'column':[0,0,0,0]})
Input:
column
0 0
1 0
2 0
3 0
df.groupby('column', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[0, -1]]).reset_index(level=0, drop=True)
Output:
column
0 0
3 0
I am trying to select the first 2 columns and the last 2 column from a data frame by index with pandas and save it on the same dataframe.
is there a way to do that in one step?
You can use the iloc function to get the columns, and then pass in the indexes.
df.iloc[:,[0,1,-1,-2]]
You are looking for iloc:
df = pd.DataFrame([[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7]], columns=['a','b','c','d','e'])
df.iloc[:,:2] # Grabs all rows and first 2 columns
df.iloc[:,-2:] # Grabs all rows and last 2 columns
pd.concat([df.iloc[:,:2],df.iloc[:,-2:]],axis=1) # Puts them together row wise
df = pd.DataFrame([[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7]], columns=['a','b','c','d','e'])
df[['a','b','d','e']]
result
a b d e
0 1 2 4 5
1 2 3 5 6
2 3 4 6 7
I have a dataframe, df, and a list of strings, cols_needed, which indicate the columns I want to retain in df. The column names in df do not exactly match the strings in cols_needed, so I cannot directly use something like intersection. But the column names do contain the strings in cols_needed. I tried playing around with str.contains but couldn't get it to work. How can I subset df based on cols_needed?
import pandas as pd
df = pd.DataFrame({
'sim-prod1': [1,2],
'sim-prod2': [3,4],
'sim-prod3': [5,6],
'sim_prod4': [7,8]
})
cols_needed = ['prod1', 'prod2']
# What I want to obtain:
sim-prod1 sim-prod2
0 1 3
1 2 4
With the regex option of filter
df.filter(regex='|'.join(cols_needed))
sim-prod1 sim-prod2
0 1 3
1 2 4
You can explore str.contains with a joint pattern, for example:
df.loc[:,df.columns.str.contains('|'.join(cols_needed))]
Output:
sim-prod1 sim-prod2
0 1 3
1 2 4
A list comprehension could work as well:
columns = [cols for cols in df
for col in cols_needed
if col in cols]
['sim-prod1', 'sim-prod2']
In [110]: df.loc[:, columns]
Out[110]:
sim-prod1 sim-prod2
0 1 3
1 2 4
I am trying to add additional index rows to an existing pandas dataframe after loading csv data into it.
So let's say I load my data like this:
columns = ['Relative_Pressure','Volume_STP']
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.columns = columns
where contents is a string in csv format. The resulting DataFrame might look something like this:
For clarity reasons I would now like to add additional index rows to the DataFrame as shown here:
However in the link these multiple index rows are generated right when the DataFrame is created. I would like to add e.g. rows for unit or descr to the columns.
How could I do this?
You can create a MultiIndex on the columns by specifically creating the index and then assigning it to the columns separately from reading in the data.
I'll use the example from the link you provided. The first method is to create the MultiIndex when you make the dataframe:
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
As stated, this is not what you are after. Instead, you can make the dataframe (from your file for example) and then make the MultiIndex from a set of lists and then assign it to the columns:
df = pd.DataFrame({'desc A':[1,2,3], 'desc B':[4,5,6]})
# Output
desc A desc B
0 1 4
1 2 5
2 3 6
# Create a multiindex from lists
index = pd.MultiIndex.from_arrays((['A', 'B'], [1, 2], ['desc A', 'desc B']))
# Assign to the columns
df.columns = index
# Output
A B
1 2
desc A desc B
0 1 4
1 2 5
2 3 6
# Name the columns
df.columns.names = ['NAME','LENGTH','DESCRIPTION']
# Output
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
There are other ways to construct a MultiIndex, for example, from_tuples and from_product. You can read more about Multi Indexes in the documentation.
Attempting to drop a column from a DataFrame in Pandas. DataFrame created from a text file.
import pandas as pd
df = pd.read_csv('sample.txt')
df.drop(['a'], 1, inplace=True)
However, this generates the following error:
ValueError: labels ['a'] not contained in axis
Here is a copy of the sample.txt file :
a,b,c,d,e
1,2,3,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8
Thanks in advance.
So the issue is that your "sample.txt" file doesn't actually include the data you are trying to remove.
Your line
df.drop(['id'], 1, inplace=True)
is attepmting to take your DataFrame (which includes the data from your sample file), find the column where the value is 'id' in the first row (axis 1) and do an inplace replace (modify the existing object rather than create a new object missing that column, this will return None and just modify the existing object.).
The issue is that your sample data doesn't include a column with a header equal to 'id'.
In your current sample file, you can only to a drop where the value in axis 1 is 'a', 'b', 'c', 'd', or 'e'. Either correct your code to drop one of those values or get a sample files with the correct header.
The documentation for Pandas isn't fantastic, but here is a good example of how to do a column drop in Pandas: http://chrisalbon.com/python/pandas_dropping_column_and_rows.html
** Below added in response to Answer Comment from #saar
Here is my example code:
Sample.txt:
a,b,c,d,e
1,2,3,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8
Sample Code:
import pandas as pd
df = pd.read_csv('sample.txt')
print('Current DataFrame:')
print(df)
df.drop(['a'], 1, inplace=True)
print('\nModified DataFrame:')
print(df)
Output:
>>python panda_test.py
Current DataFrame:
a b c d e
0 1 2 3 4 5
1 2 3 4 5 6
2 3 4 5 6 7
3 4 5 6 7 8
Modified DataFrame:
b c d e
0 2 3 4 5
1 3 4 5 6
2 4 5 6 7
3 5 6 7 8
bad= pd.read_csv('bad_modified.csv')
A=bad.sample(n=10)
B=bad.drop(A.index,axis=0)
This is an example of dropping a dataframe partly.
In case you need it.