Hello I have iterated through multiple columns and it worked. BUT The column names in all the CSV files are in order like so:
Output: id title content tags
However my code outputs the columns in this order:
Output : content id tags title
How do I get it back in the order that all the csv files have it as
here is my code below:
import glob
import os
import pandas as pd
pd.set_option("display.max_rows", 999)
pd.set_option('max_colwidth',100)
import numpy as np
from IPython.display import display
%matplotlib inline
file_path = 'data/'
all_files = glob.glob(os.path.join(file_path, "*.csv"))
merging_csv_files = (pd.read_csv(f) for f in all_files)
stack_exchange_data = pd.concat(merging_csv_files, ignore_index=True)
print ("Data loaded succesfully!")
print ("Stack Exchane Data has {} rows with {} columns each.".format(*stack_exchange_data.shape))
The general way for selecting a DataFrame with columns in a specific order is to simply create a list of the order you desire and then pass that list to the bracket operator of the DataFrame like this:
my_col_order = ['id', 'title', 'content', 'tags']
df[my_col_order]
Also you might want to check that all the DataFrames indeed have the same column order. I don't believe Pandas will sort the column names in concat unless there is at least one DataFrame that has a different column ordering. You might want to print out all the column names from all the DataFrames you are concatenating.
Related
In the code that I present, it reads csv files that are in one folder and prints them in another.In each of these csv contains two columns which were chosen when the dataframe was defined. In column f I need to count how many times this value was above 50.025 and write it in some column
CODE:
import pandas as pd
import numpy as np
import glob
import os
all_files = glob.glob("C:/Users/Gamer/Documents/Colbun/Saturn/*.csv")
file_list = []
for i,f in enumerate(all_files):
df = pd.read_csv(f,header=0,usecols=["t","f"])
df.apply(lambda x: x['f'] > 50.025, axis=1)
df.to_csv(f'C:/Users/Gamer/Documents/Colbun/Saturn2/{os.path.basename(f).split(".")[0]}_ext.csv')
its not logical to store it in some column... since
its the summary of entire table..not specific to any
row.
df = pd.read_csv(f,header=0,usecols=["t","f"])
how_many_times= len( df[df['f'] > 50.025] )
# you may store it in some unique column but it still doesnt make sense
df['newcol']=how_many_times
To output the count of occurrences in a column according to a particular filter and add it to every row in another column you can simply do the following:
df['cnt'] = df[df['f'] > 50.025]['f'].count()
If you need to use that count to then perform a calculation it would be better to store it in a variable and them perform the calculation while using said variable rather that storing it in your dataframe in an entire column.
In addition I can see from your comments to your question that you also want to remove the index when outputting to CSV so to do that you need to add index=False to the df.to_csv() call.
Your code should look something like this:
import pandas as pd
import numpy as np
import glob
import os
all_files = glob.glob("C:/Users/Gamer/Documents/Colbun/Saturn/*.csv")
file_list = []
for i,f in enumerate(all_files):
df = pd.read_csv(f,header=0,usecols=["t","f"])
df['cnt'] = df[df['f'] > 50.025]['f'].count()
df.to_csv(f'C:/Users/Gamer/Documents/Colbun/Saturn2/{os.path.basename(f).split(".")[0]}_ext.csv', index=False)
I would like to create a scalable code to import multiple CSV files, standardize the order of the colnumns based on the colnames and re-write CSV files.
import glob
import pandas as pd
# Get a list of all the csv files
csv_files = glob.glob('*.csv')
# List comprehension that loads of all the files
dfs = [pd.read_csv(x,delimiter=";") for x in csv_files]
A=pd.DataFrame(dfs[0])
B=pd.DataFrame(dfs[1])
alpha=A.columns.values.tolist()
print([pd.DataFrame(x[alpha]) for x in dfs])
I would like to be able to split this object and write CSV for each of the file and rename them with the original names. is that easily possible with python? Thansk for your help.
If you want to reorder columns by a consistent order, assuming that all csv's have the same column names but in a different order, you can sort one of the column name lists and then order the other ones by that list. Using your example:
csv_files = glob.glob('*.csv')
sorted_columns = []
for e,x in enumerate(csv_files):
df = pd.read_csv(x,delimiter=";")
if e==0:
sorted_columns = sorted(df.columns.values.tolist())
df[sorted_columns].to_csv(x, sep=";")
I have a best practice question. Today i learned how to Read and write files in Pandas. How to create a Table, how to add a column and row and how to drop them.
I have an excel file with the following content:
I create a new Column "Price_average" and I average "Price_min" and "Price_max" and output it as output_1.xlsx
#!/usr/bin/env python3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xlrd
df = pd.read_excel('original.xlsx')
print (df)
df['Price_average'] = (df.Price_min + df.Price_max)/2
df.to_excel('output_1.xlsx', sheet_name='sheet1', index=False)
print (df)
I then prop the columns "Price_min" and "price_max" with:
df = df.drop(['Price_min', 'Price_max'], axis=1)
And lets say I want to Create This table now:
I can either delete "Age" and "Price_average" and and swap "email" with "brand" or can I simply select the Columns I want to create a new spreadsheet?
Whats the best and cleanest way to do it? To subtract the unwanted Columns from the file and rearrange and if wanted rename the columns or Pick and choose the needed columns and create a new file with them in the right order. Any suggestions? And what's the cleanest way to solve it?
You can try this,
selected = df[['Age', 'Price_average', 'Email', 'Brand']]
If you want to change column names,
renamed = selected.rename(columns={'Brand': 'brand', 'Email':'email'})
I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.
Hello I have xlsx files and merged them into one dataframe by using pandas. It worked but instead of getting back the column names that I had in the xlsx file I got numbers as columns instead and the column titles became a row: Like this:
Output: 1 2 3
COLTITLE1 COLTITLE2 COLTITLE3
When they should be like this:
Output: COLTITLE1 COLTITLE2 COLTITLE3
The column titles are not column titles but rather they have become a row. How can I get back the rightful column names that I had within the xlsx file. Just for clarity all the column names are the same within both the xlsx files. Help would be appreciated heres my code below:
# import modules
from IPython.display import display
import pandas as pd
import numpy as np
pd.set_option("display.max_rows", 999)
pd.set_option('max_colwidth',100)
%matplotlib inline
# filenames
file_names = ["data/OrderReport.xlsx", "data/OrderReport2.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in file_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
# concatenate them
atlantic_data = pd.concat(frames)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
I hope I understood your question correctly. You just need to get rid of the index_col=None and it will return the column name as usual:
frames = [x.parse(x.sheet_names[0], header=None) for x in excels]
If you add index_col=None pandas will treat your column name as 1 row of data rather than a column for the dataframe.