I have links in a column of a data frame in pandas. whenever I try to iterate through that column(links) and get some text data the following happens.
suppose df is the data frame
for i in df:
for line in urllib.request.urlopen(i):
decoded_line = line.decode("utf-8")
print(decoded_line)
if I run the above code, it shows an error.
Then When I printed that column, I saw those column elements (links) end with a bunch of dots...
After searching a little I did,
pd.options.display.max_colwidth = 100
And it worked fine.
But I am curious how changing the " display column width " resolves my issue.
As far as I understood, when I was working with pd.options.display.max_colwidth = 50 the 'i' in for loop was taking some portion of the links with a bunch of dots in the end(why? How display width changes values actually taken by 'i'), and now when I change the display column width to 100 pd.options.display.max_colwidth = 100 it is taking the whole link. But why?
And is pd.options.display.max_colwidth changes only the display col-width or it has aslo something to do with the actual value?
Please help
Thank you in advance.
Related
I am working with one data frame in pandas only. This error does not occur when I perform the following on a subset of this data frame (6 rows with NaN in some). And it does exactly what I needed done. In this case all the NaN in the 'Season' column got filled out properly.
Before:
Code:
s = df.set_index('Description')['Season'].dropna()
df['Season'] = df['Season'].fillna(df['Description'].map(s))
After:
Great! This is what I want to happen, one column at time. I worry about the other columns later.
But then I try the same code above on the entire data frame which is over 5000 rows long, then I get the error stated above in the title, and I am unable to pin point it to a specific row(s).
What did I try:
I removed all non-ascii characters, and these special characters: ', ", and # from the strings in the 'Description' column which sometimes has 50 characters including non-ascii and the three specific characters that I removed.
df['Description'] = df['Description'].str.encode('ascii', 'ignore').str.decode('ascii')
df['Description'] = df['Description'].str.replace('"', '')
df['Description'] = df['Description'].str.replace("'", "")
df['Description'] = df['Description'].str.replace('#', '')
But the above did not help, and I still get the error. Does anyone have additional troubleshooting tips, or know what I am failing to look for? Or ideally a solution.
The code with subset DataFrame and main DataFrame are isolated. So I am not mixing and using the 'df' and 's' interchangeably. I wish that was the problem.
Recall the subset data frame above where the code worked perfectly. Through process of elimination I discovered that when the subset data frame has one extra row - total of 8 rows, the code still works as expected. But once the 9th row is entered, the I get the error. I can't figure out why.
Then the code:
s = df.set_index('Description')['Season'].dropna()
df['Season'] = df['Season'].fillna(df['Description'].map(s))
And the data frame is updated as expected:
But when the 9th row added then the code above does not work:
Discovered how to solve the problem by adding .drop_duplicates('Description') and therefore modifying:
s= df.set_index('Description')['Season'].dropna()
to
s= df.drop_duplicates('Description').set_index('Description')['Season'].dropna()
I am trying to display a column in full (containing tweet texts) in Python, but even if I use pd.set_option('display.max_colwidth', None), the output returns as truncated. The code below lets me display the maximum width, but what I want is to save the file as output with max column width. Any help is appreciated.
def display_text_max_col_width(df, width):
with pd.option_context('display.max_colwidth', width):
print(df)
display_text_max_col_width(df["col"], 1000)
[tweets ending with ellipsis][1]
any text can be truncated by doing
MAX = 80
print(text[:MAX])
So its quite easy that way.
EDIT From your comments. I could not see any picture you posted but here is my suggestion.
if you have a long string (lets take part of this question) as some long strings.
txt1 = """ I am trying to display a column in full (containing tweet texts) in Python, but even if I use pd.set_option('display.max_colwidth', None), the output returns as truncated. The code below lets me display the maximum width, but what I want is to save the file as output with max column width. Any help is appreciated. """
txt2 = """Thanks for your response. Indeed, I am trying to get an untruncated column, which contains text, and save it as an output file (e.g., csv) as such. The problem is, ('display.max_colwidth', None) doesn't give me untruncated text in the related column. – Kamil Yilmaz 2 days ago"""
you are saying this is truncating or wont work for you, but that does work for me.
>>> with pd.option_context('display.max_colwidth', None): print(pd.DataFrame([txt1,txt2])) ...
0
0 I am trying to display a column in full (containing tweet texts) in Python, but even if I use pd.set_option('display.max_colwidth', None), the output returns as truncated. The code below lets me display the maximum width, but what I want is to save the file as output with max column width. Any help is appreciated.
1 Tanks for your response. Indeed, I am trying to get an untruncated column, which contains text, and save it as an output file (e.g., csv) as such. The problem is, ('display.max_colwidth', None) doesn't give me untruncated text in the related column. – Kamil Yilmaz 2 days ago
>>>
However I would not use this method of displaying anything. You have got to either format it your self by running down the rows and formatting each cell yourself. Or contend with some wackiness from unfiltered control characters that the tweets may contain.
you can iterate over that quite easy with the 1st method I suggested:
>>> _ = [print(line[:60]) for line in df[0]]
I am trying to display a column in full (containing tweet te
Tanks for your response. Indeed, I am trying to get an untru
you can use tabulate to format the data frame if you dont want to iterate over the data.
>>> from tabulate import tabulate
>>> print(tabulate(df))
- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0 I am trying to display a column in full (containing tweet texts) in Python, but even if I use pd.set_option('display.max_colwidth', None), the output returns as truncated. The code below lets me display the maximum width, but what I want is to save the file as output with max column width. Any help is appreciated.
1 Tanks for your response. Indeed, I am trying to get an untruncated column, which contains text, and save it as an output file (e.g., csv) as such. The problem is, ('display.max_colwidth', None) doesn't give me untruncated text in the related column. – Kamil Yilmaz 2 days ago
- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Or use a combination of both methods. But I'm not a fan of Pandas output as it is more bothersome than its worth.
print will print to a file if you provide a fp to it.
>>> with open('/tmp/text', 'w') as fp:
... [print(line[:150], file=fp) for line in df[0]]
...
[None, None]
>>> [print(line) for line in open('/tmp/text').readlines() ]
I am trying to display a column in full (containing tweet texts) in Python, but even if I use pd.set_option('display.max_colwidth', None), the output
Tanks for your response. Indeed, I am trying to get an untruncated column, which contains text, and save it as an output file (e.g., csv) as such. The
I am trying to search through a pandas dataframe row by row and see if 3 variables are in the name of the file. If they are in the name of the file, more variables are extracted from that same row. For instance I am checking to see if the concentration, substrate and the number of droplets match the file name. If this condition is true which will only happen one as there are no duplicates, I want to extract the frame rate and the time from that same row. Below is my code:
excel_var = 'Experiental Camera.xlsx'
workbook = pd.read_excel(excel_var, "PythonTable")
workbook.Concentration.astype(int, errors='raise')
for index, row in workbook.iterrows():
if str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets']) in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
Attached is a example of what my spreadsheet looks like and what my Path_ext is
At the moment nothing is being saved for the Actual_Frame_Rate and I don't know why. I have attached the pictures to show that it should match. Is there anything wrong with my code /. is there a better way to go about this. Any help is much appreciated.
So am unsure why this helped but fixed is by just combining it all into one string and matching is like that. I used the following code:
for index, row in workbook.iterrows():
match = 'water(' + str(row['Concentration']) + '%)-' + str(row['substrate']) + str(-+row['droplets'])
# str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets'])
if match in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
This code now produces the correct answer but am unsure why I can't use the other method as of yet.
I'm working on my first correlation analysis. I've received the data through an excel file, I've imported it as Dataframe (had to pivot it) and now I have a set of almost 3000 rows and 25000 columns. I can't choose a subset from it, as every column is important for this project and I also don't know what information every column stores in order to choose the most interesting ones, because it is encoded with integer numbers (it is an university project). It is like a big questionnaire, where every person has his/hers own row and the answers for every question are stored in a different column.
I really need to solve this issue because later I'll have to replace the many Nans with the medians of the columns and then start the correlation analysis. I tried this part first and it didn't go because of the size so that's why I've tried downcasting first
The dataset has 600 MB and I used the downcasting instruction for the floats and saved 300 MB but when I try to replace the new columns in a copy of my dataset, it runs for 30 minutes and it doesn't do anything. No warning, no error until I interrupt the kernel and it still gives me no hint why it doesn't work.
I can't use the delete Nans instruction first, because there are so many, that it will erase almost everything.
#i've got this code from https://www.dataquest.io/blog/pandas-big-data/
def mem_usage(pandas_obj):
if isinstance(pandas_obj,pd.DataFrame):
usage_b = pandas_obj.memory_usage(deep=True).sum()
else: # we assume if not a df it's a series
usage_b = pandas_obj.memory_usage(deep=True)
usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
return "{:03.2f} MB".format(usage_mb)
gl_float = myset.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')
print(mem_usage(gl_float)) #almost 600
print(mem_usage(converted_float)) #almost 300
optimized_gl = myset.copy()
optimized_gl[converted_float.columns]= converted_float #this doesn't end
after the replacement works, I want to use the Imputer function for the Nans-replacement and print the correlation result for my dataset
in the end I've decided to use this:
column1 = myset.iloc[:,0]
converted_float.insert(loc=0, column='ids', value=column1)
instead of the lines with optimized_gl and it solved it but it was possible only because every column changed except for the first one. So I just had to add the first to the others.
I'm trying to make nicely formatted tables from pandas. Some of my column names are far too long. The cells for these columns are large cause the whole table to be a mess.
In my example, is it possible to rotate the column names as they are displayed?
data = [{'Way too long of a column to be reasonable':4,'Four?':4},
{'Way too long of a column to be reasonable':5,'Four?':5}]
pd.DataFrame(data)
Something like:
data = [{'Way too long of a column to be reasonable':4,'Four?':4},
{'Way too long of a column to be reasonable':5,'Four?':5}]
dfoo = pd.DataFrame(data)
dfoo.style.set_table_styles(
[dict(selector="th",props=[('max-width', '80px')]),
dict(selector="th.col_heading",
props=[("writing-mode", "vertical-rl"),
('transform', 'rotateZ(-90deg)'),
])]
)
is probably close to what you want:
see result here
Looking at the pybloqs source code for the accepted answer's solution, I was able to find out how to rotate the columns without installing pybloqs. Note that this also rotates the index, but I have added code to remove those.
from IPython.display import HTML, display
data = [{'Way too long of a column to be reasonable':4,'Four?':4},
{'Way too long of a column to be reasonable':5,'Four?':5}]
df = pd.DataFrame(data)
styles = [
dict(selector="th", props=[("font-size", "125%"),
("text-align", "center"),
("transform", "translate(0%,-30%) rotate(-5deg)")
]),
dict(selector=".row_heading, .blank", props= [('display', 'none;')])
]
html = df.style.set_table_styles(styles).render()
display(HTML(html))
I placed #Bobain's nice answer into a function so I can re-use it throughout a notebook.
def format_vertical_headers(df):
"""Display a dataframe with vertical column headers"""
styles = [dict(selector="th", props=[('width', '40px')]),
dict(selector="th.col_heading",
props=[("writing-mode", "vertical-rl"),
('transform', 'rotateZ(180deg)'),
('height', '290px'),
('vertical-align', 'top')])]
return (df.fillna('').style.set_table_styles(styles))
format_vertical_headers(pandas.DataFrame(data))
Using the Python library 'pybloqs' (http://pybloqs.readthedocs.io/en/latest/), it is possible to rotate the column names as well as add a padding to the top. The only downside (as the documentation mentions) is that the top-padding does not work inline with Jupyter. The table must be exported.
import pandas as pd
from pybloqs import Block
import pybloqs.block.table_formatters as tf
from IPython.core.display import display, HTML
data = [{'Way too long of a column to be reasonable':4,'Four?':4},
{'Way too long of a column to be reasonable':5,'Four?':5}]
dfoo =pd.DataFrame(data)
fmt_header = tf.FmtHeader(fixed_width='5cm',index_width='10%',
top_padding='10cm', rotate_deg=60)
table_block = Block(dfoo, formatters=[fmt_header])
display(HTML(table_block.render_html()))
table_block.save('Table.html')
I can get it so that the text is completely turned around 90 degrees, but can't figure out how to use text-orientation: upright as it just makes the text invisible :( You were missing the writing-mode property that has to be set for text-orientation to have any effect. Also, I made it only apply to column headings by modifying the selector a little.
dfoo.style.set_table_styles([dict(selector="th.col_heading",props=[("writing-mode", "vertical-lr"),('text-orientation', 'upright')])])
Hopefully this gets you a little closer to your goal!