I've tried to mention pandas setting and set display width to no avail. I just want vscode to show me the full output. Some other posts said there was a setting called 'Data Science'. I could not find that anywhere in vscode.
This was a .py file it was running some other code which contain s3 bucket URLs which is why I refrained from showing the original code.
Nevertheless I can emulate the same in this simpler code snippet on another file which creates a dataframe as so.
import pandas as pd
df_col1 = [12132432432423432,32423432432534523,34354354353454333,44353453453453454,53454353453453453]
df_col2 = ['test_url_thats_too_big_to_display_here' + str(i) for I in df_col1]
df = pd.DataFrame(list(zip(df_col1, df_col2)), columns = ['a', 'b'])
print(df)
The above code creates 2 columns one with an id number and the other with an id appended with the URL. Below is the output of the code.
There is no reason to make data casting or novel series conversions. You can change pandas' configurations.
Check the documentation here.
Here a solution:
pd.get_option("display.max_colwidth") # this is for your info only
50
pd.set_option("display.max_colwidth", None) # after this you can print any column length
# try it out:
s = pd.Series([["a" * 150]])
s
As per #wwii 's suggestion. using
print(df.to_string())
works perfectly
Here is a sample output of the code with last line change to the above
Related
I’m trying to read an unknown large csv file with pandas.
I came across some errors so I added the following arguments:
df = pd.read_csv(csv_file, engine="python", error_bad_lines=False, warn_bad_lines=True)
It is working good and skipping offending lines, and errors are prompted to the terminal correctly, such as:
Skipping line 31175: field larger than field limit (131072)
However, I’d like to save all errors to a variable instead of printing them.
How can I do it?
Note that I have a big program here and can't change the output of all logs from file=sys.stdout to something else. I need a case specific solution.
Thanks!
use on_bad_lines capability instead (available in pandas 1.4+):
badlines_list = []
def badlines_collect (bad_line: list[str]) -> None:
badlines_list.append(bad_line)
return None
df = pd.read_csv(csv_file, engine="python",on_bad_lines=badlines_collect)
I am trying to overwrite a value in a given cell using openpyxl. I have two sheets. One is called Raw, it is populated by API calls. Second is Data that is fed off of Raw sheet. Two sheets have exactly identical shape (cols/rows). I am doing a comparison of the two to see if there is a bay assignment in Raw. If there is - grab it to Data sheet. If both Raw and Data have the value in that column missing - then run a complex Algo (irrelevant for this question) to assign bay number based on logic.
I am having problems with rewriting Excel using openpyxl.
Here's example of my code.
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
no_bay_res = data_df[data_df['Bay assignment'].isnull()].reset_index() #grab rows where there is no bay assignment in a specific column
book = load_workbook("Algo Build v23test.xlsx")
sheet = book["MondayData"]
for index, reservation in no_bay_res.iterrows():
idx = int(reservation['index'])
if pd.isna(raw_df.iloc[idx, 13]):
continue
else:
value = raw_df.iat[idx,13]
data_df.iloc[idx, 13] = value
sheet.cell(idx+2, 14).value = int(value)
book.save("Algo Build v23test.xlsx")
book.close()
print(value) #302
Now the problem is that it seems that book.close() is not working. Book is still callable in python. Now, it overwrites Excel totally fine. However, if I try to run these two lines again
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
I am getting datasets full of NULL values, except for the value that was replaced. (attached the image).
However, if I open that Excel file manually from the folder and save it (CTRL+S) and try running the code again - it works properly. Weirdest problem.
I need to loop the code above for Monday-Sunday, so I need it to be able to read the data again without manually resaving the file.
Due to some reason, pandas will read all the formulas as NaN after the file been used in the script by openpyxl until the file has been opened, saved and closed. Here's the code that helps do that within the script. However, it is rather slow.
import xlwings as xl
def df_from_excel(path, sheet_name):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path, sheet_name)
I got the same problem, the only workaround I found is to terminate the excel.exe manually from taskmanager. After that everything went fine.
I recently started diving into algo trading and building a bot for crypto trading.
For this i created a backtester with pandas to run different strategies with different parameters. The datasets (csv files) I use are rather larger (around 40mb each).
These are processed, but as soon as i want to save the processed data to a csv, nothing happens. No output whatsoever, not even an error message. I tried to use the full path, I tried to save it just with the filename, I even tried to save it as a .txt file. Nothing seems to work. I also tried the solutions I was able to find on stackoverflow.
I am using Anaconda3 in case that could be the source of my problem.
Here you can find the part of my code ,which tries to save the dataframe to a file.
results_df = pd.DataFrame(results)
results_df.columns = ['strategy', 'number_of_trades', "capital"]
print(results_df)
for i in range(2, len(results_df)):
if results_df.capital.iloc[i] < results_df.capital.iloc[0]:
results_df.drop([i],axis="index")
#results to csv
current_dir = os.getcwd()
results_df.to_csv(os.getcwd()+'\\file.csv')
print(results_df)
Thank you for your help!
You can simplifiy your code by a great deal and write it as (should also run faster):
results_df = pd.DataFrame(results)
results_df.columns = ['strategy', 'number_of_trades', "capital"]
print(results_df)
first_row_capital= results_df.capital.iloc[0]
indexer_capital_smaller= results_df.capital < first_row_capital
values_to_delete= indexer_capital_smaller[indexer_capital_smaller].index
results_df.drop(index=values_to_delete, inplace=True)
#results to csv
current_dir = os.getcwd()
results_df.to_csv(os.getcwd()+'\\file.csv')
print(results_df)
I think, the main problem in your code might be, that you write the csv each time you found an entry in the dataframe where capital sattisfies the condition and you write it only if you find such a case.
And if you just do the deletion for the csv output but don't need the dataframe in memory anymore, you can make it even simpler:
results_df = pd.DataFrame(results)
results_df.columns = ['strategy', 'number_of_trades', "capital"]
print(results_df)
first_row_capital= results_df.capital.iloc[0]
indexer_capital_smaller= results_df.capital < first_row_capital
#results to csv
current_dir = os.getcwd()
results_df[indexer_capital_smaller].to_csv(os.getcwd()+'\\file.csv')
print(results_df[indexer_capital_smaller])
This second variant only applies a filter before writing the filtered lines and before printing the content.
I have a file named "sample name_TIC.txt". The first three columns in this file are useful - Scan, Time, and TIC. It also has 456 not useful columns after the first 3. To do other data processing, I need these not-useful columns to go away. So I wrote a bit of code to start:
os.chdir(main_folder)
mydir = (os.getcwd())
nameslist=['Scan','Time', 'TIC']
for path, subdirs, files in os.walk(mydir):
for file in files:
if (file.endswith('TIC.txt')):
myfile=os.path.join(path, file)
TIC_df = pd.read_csv(myfile,sep="\t",skiprows=1, usecols=[0,1,2],names=nameslist)
Normally, the for loop is set into a function that is iterated over a very large set of folders with a lot of samples, hence the os.walk stuff, but we can ignore that right now. This code will be completed to save a new .txt file with only the 3 relevant columns.
The problem comes in the last line, the pd.read_csv line. This results in a dataframe with an index column that comprises the data from the first 456 columns and the last 3 columns of the .txt are given the names in nameslist and callable as columns in pandas, (i.e. using .iloc). This is not a multi-index. It is a single index with all the data and whitespace of those first columns.
In this example code sep="\t" because that's how excel can successfully import it. But I've also tried:
sep="\s"
delimiter=r"\s+" rather than a sep argument
including header=None
not including the usecols argument I made an error, and did not call the proper result from this code edit. This is the correct solution. See edit below or the answer.
setting index_col=False
How can I get pd.read_csv to take the first 3 columns and ignore the rest?
Thanks.
EDIT: In my end-of-day foolishness, I made an error, changing the target df to the example TIC_df. In the original code set I took this from, this was named mz207_df. My call function was still referncing the old df name.
Changing the last line of code to:
TIC_df = pd.read_csv(myfile,sep="\s+",skiprows=1, usecols[0,1,2],names=nameslist)
successfully resolved my problem. Using sep="\t" also worked. Sorry for wasting people's time. I will post this with an answer as well in case someone needs to learn about usecols like I did.
Answering here to make sure the problem gets flagged as answered, in case someone else searches for it.
I made an error when calling the result from the code which included the usecols=[0,1,2] argument, and I was calling an older dataframe. The following line of code successfully generated the desired code.
TIC_df = pd.read_csv(myfile,sep="\s+",skiprows=1, usecols=[0,1,2],names=nameslist)
Using sep="\t" also generated the correct dataframe, but I default to \s+ to accomdate different and varible formatting from analytical machine outputs.
I got a really strange problem. I'm trying to read some data from an excel file, but the property nrows has a wrong value. Although my file has a lot of rows, it just returns 2.
I'm working in pydev eclipse. I don't know what is actually the problem; everything looks fine.
When I try to access other rows by index manually, but I got the index error.
I appreciate any help.
If it helps, it's my code:
def get_data_form_excel(address):
wb = xlrd.open_workbook(address)
profile_data_list = []
for s in wb.sheets():
for row in range(s.nrows):
if row > 0:
values = []
for column in range(s.ncols):
values.append(str(s.cell(row, column).value))
profile_data_list.append(values)
print str(profile_data_list)
return profile_data_list
To make sure your file is not corrupt, try with another file; I doubt xlrd is buggy.
Also, I've cleaned up your code to look a bit nicer. For example the if row > 0 check is unneeded because you can just iterate over range(1, sheet.nrows) in the first place.
def get_data_form_excel(address):
# this returns a generator not a list; you can iterate over it as normal,
# but if you need a list, convert the return value to one using list()
for sheet in xlrd.open_workbook(address).sheets():
for row in range(1, sheet.nrows):
yield [str(sheet.cell(row, col).value) for col in range(sheet.ncols)]
or
def get_data_form_excel(address):
# you can make this function also use a (lazily evaluated) generator instead
# of a list by changing the brackets to normal parentheses.
return [
[str(sheet.cell(row, col).value) for col in range(sheet.ncols)]
for sheet in xlrd.open_workbook(address).sheets()
for row in range(1, sheet.nrows)
]
After trying some other files I'm sure it's about the file, and I think it's related to Microsoft 2003 and 2007 differences.
I recently got this problem too. I'm trying to read an excel file and the row number given by xlrd.nrows is less than the actual one. As Zeinab Abbasi saied, I tried other files but it worked fine.
Finally, I find out the difference : there's a VB-script based button embedded in the failed file, which is used to download and append records to the current sheet.
Then, I try to convert the file to .xlsx format, but it asks me to save as another format with macro enabled, e.g .xlsm. This time xlrd.nrows gives the correct value.
Is your excel file using external data? I just had the same problem and found a fix. I was using excel to get info from a google sheet, and I wanted to have python show me that data. So, the fix for me was going to DATA>Connections(in "Get External Data")>Properties and unchecking "Remove data from the external data range before saving the workbook"