Im trying to divide some columns by fixed number (1000) and remove commas, also change mix type into int with the second last code line. Except the list of columns, I do have other columns that are being deleted after executing the code. How can I keep other columns?
df_1 = pd.read_excel(os.path.join(directory,'copy.xlsm'), sheet_name= "weekly",header= None)
df_1 = df_1.drop(df_1.columns[[0,1]], axis=1)
df_1.columns = df_1.loc[3].rename(None)
df_1 = df_1.drop(range(5))
columns =["A","B","D", "G"]
df_1=df_1.loc[:len(df_1) - 2, columns].replace(',', '', regex=True).apply(pd.to_numeric) / 1000
df_1.to_csv(directory+'new.csv', index=False, header= True)
Your problem is in this part:
df_1 = df_1.loc[...]...
You're overriding the original value of df_1 to a subset of your columns (and it seems that you're losing some rows too) when using this selector: [:len(df_1) - 2, columns]. You only need to update the values of that selection:
df_1.loc[...] = df_1.loc[...]...
By using loc as the target value to store the result of your operation, you're only modifying those rows and columns with the values where they should be.
Your code should contain this line instead (added for clarity):
df_1.loc[:len(df_1) - 2, columns] = df_1.loc[:len(df_1) - 2, columns].replace(',', '', regex=True).apply(pd.to_numeric) / 1000
Related
I have several large csv filess each 100 columns and 800k rows. Starting from the first column, every other column has cells that are like python list, for example: in cell A2, I have [1000], in cell A3: I have [2300], and so forth. Column 2 is fine and are numbers, but columns 1, 3, 5, 7, etc, ...99 are similar to the column 1, their values are inside list. Is there an efficient way to remove the sign of the list [] from those columns and make their cells like normal numbers?
files_directory: r":D\my_files"
dir_files =os.listdir(r"D:\my_files")
for file in dir_files:
edited_csv = pd.read_csv("%s\%s"%(files_directory, file))
for column in list(edited_csv.columns):
if (column % 2) != 0:
edited_csv[column] = ?
Please try:
import pandas as pd
df = pd.read_csv('file.csv', header=None)
df.columns = df.iloc[0]
df = df[1:]
for x in df.columns[::2]:
df[x] = df[x].apply(lambda x: float(x[1:-1]))
print(df)
When reading the cells, for example column_1[3], which in this case is [4554.8433], python will read them as arrays. To read the numerical value inside the array, simply read the values like so:
value = column_1[3]
print(value[0]) #prints 4554.8433 instead of [4554.8433]
I have pandas df which has 7000 rows * 7 columns. And I have list (row_list) that consists with the value that I want to filter out from df.
What I want to do is to filter out the rows if the rows from df contain the corresponding value in the list.
This is what I got when I tried,
"Empty DataFrame
Columns: [A,B,C,D,E,F,G]
Index: []"
df = pd.read_csv('filename.csv')
df1 = pd.read_csv('filename1.csv', names = 'A')
row_list = []
for index, rows in df1.iterrows():
my_list = [rows.A]
row_list.append(my_list)
boolean_series = df.D.isin(row_list)
filtered_df = df[boolean_series]
print(filtered_df)
replace
boolean_series = df.RightInsoleImage.isin(row_list)
with
boolean_series = df.RightInsoleImage.isin(df1.A)
And let us know the result. If it doesn't work show a sample of df and df1.A
(1) generating separate dfs for each condition, concat, then dedup (slow)
(2) a custom function to annotate with bool column (default as False, then annotated True if condition is fulfilled), then filter based on that column
(3) keep a list of indices of all rows with your row_list values, then filter using iloc based on your indices list
Without an MRE, sample data, or a reason why your method didn't work, it's difficult to provide a more specific answer.
I have a dataframe in which the first column contains a list of random size, from 0 to around 10 items in each list. This dataframe also contains several other columns of data.
I would like to insert as many columns as the length of the longest list, and then populate the values across sequentially such that each column has one item from the list in column one.
I was unsure of a good way to go about this.
sample = [[[0,2,3,7,8,9],2,3,4,5],[[1,2],2,3,4,5],[[1,3,4,5,6,7,8,9,0],2,3,4,5]]
headers = ["col1","col2","col3","col4","col5"]
df = pd.DataFrame(sample, columns = headers)
In this example I would like to add 9 columns after column 1, as this is the maxiumum length of the list in the third row of the dataframe. These columns would be populated with:
0 2 3 7 8 9 NULL NULL NULL in the first row,
1 2 NULL NULL NULL NULL NULL NULL NULL in the second, etc...
Edit to fit OPs edit
This is how I would do it. First I would pad the lists of the original column so that they're all the same length and it's easier to work with them. Afterwards it's a matter of creating the columns and filling it with the value corresponding to the position in the list. Let's say our lists are of size up to 4 for an easier example:
df = pd.DataFrame(sample, columns = headers)
df = df.rename(columns={'col1':'col_of_lists'})
max_length = max(df['col_of_lists'].apply(lambda x:len(x)))
df['col_of_lists'] = df['col_of_lists'].apply(lambda x:x + ([np.nan] * (max_length - len(x))))
for i in range(max_length):
df['col_'+str(i)] = df['col_of_lists'].apply(lambda x: x[i])
The easiest way to turn a series of lists into separate columns is to use apply to convert them into a Series, which triggers the 'expand' result type:
result = df['col1'].apply(pd.Series)
At this point, we can adjust the columns from the automatically numbered to include the name of the original 'col1', for example:
result.columns = [
'col1_{}'.format(i + 1)
for i in result.columns]
Finally, we can join it back to the original DataFrame. Using the fact that this was the first column makes it easy, just joining it to the left of the original frame, dropping the original 'col1' in the process:
result = result.join(df.drop('col1', axis=1))
You can even do it all as a one-liner, by using the rename() method to change column names:
df['col1'].apply(pd.Series).rename(
lambda i: 'col1_{}'.format(i + 1),
axis=1,
).join(df.drop('col1', axis=1))
l have a csv file that l process with pandas. l have for columns as follow :
df.columns = ["id", "ocr", "raw_value", "manual_raw_value"]
However , l have some rows which have more than five columns . For instance :
id ocr raw_value manual_raw_value
2d704f42 OMNIPAGE remuneration rémunération hello
bfa6c9f14 OMNIPAGE 35470 35470
213e1e1e OMNIPAGE Echeance Echéance
l did the following in order not to read the rows with extra columns (like the first row)
df = pd.read_csv(filename, sep=",",index_col=None, error_bad_lines=False)
However the the rows with extra columns are kept.
Thank you
Another try. For easier indexing, I would rename columns, even those which are unnecessary:
df.columns = range(0, df.shape[1])
I assume, that empty places are NaN, so valid rows will have all NaN in other columns. I was not successful in searching for specific function, so I would interate through single columns and leave only those with NaN and pick only needed columns:
for i in range(4, df.shape[1]):
df = df[df.iloc[:,i].isnull()]
df = df[[0, 1, 2, 3]]
Then rename them how you want. Hope this will help.
I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.
So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.
Also only one of the values will have a string rest will be None. How can I get that column?
the relevant parameter is mangle_dupe_cols
from the docs
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.
if you used mangle_dupe_cols=False, importing this csv would produce an error.
you can get all of your columns with
df.filter(like='a')
demonstration
from StringIO import StringIO
import pandas as pd
txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
df.filter(like='a')
I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.
In : df.columns
Out:
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')
In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']
In : df.columns
Out:
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')
From here, I was able to call 'id1' or 'id2' to get just the column I wanted.
That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:
create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
duplicated_columns_list.append(column)
duplicated_columns_list
Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'
This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.
This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.
Finally, rename your columns with the underscored elements:
df.columns = list_of_all_columns
That's it, I hope it helps :)
Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id').
Hence, calling
df['id']
returns 2 columns.
You can use
df.iloc[:,ind]
where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:
indices = [i for i,x in enumerate(df.columns) if x == 'id']
where you replace 'id' with the name of the column you are searching for.