Ok, so I have aggregated a bunch of data that looks like this:
X-mean y-Mean z- Mean
1 0.3444 2.34987 1.347
2 etc.
3
4
5
6
Except, it is not three columns, but 561 of them :-)
So, it seems like such a simple problem to me: I know how to plot the first column vs. the x column using Mean_f_values.plot(y= y_vals, use_index=True).So, the column names are often a bunch of gibberish, so I want to plot individual plots by not referring to their names, but just their location. I want to do some kind of for loop and display several graphs as I try to weed out useless columns. But all I can find (so far) is that we can only refer to column name, not their location when plotting. It seems obvious to me that this cannot be true, at least with some kind of simple plotting method. I am kinda noob, so what am I missing? Thanks!
Related
I've been struggling with this one a bit and am feeling a bit stuck.
I have a dataframe consisting of data like this, named merged_frames (it is a single frame, created by concatenating a handful of frames with the same shape):
fqdn source
0 site1.org public_source_a
1 site2.org public_source_a
2 site3.org public_source_a
3 site1.org public_source_b
4 site4.org public_source_b
5 site1.org public_source_b
6 site4.org public_source_d ...
7 site1.org public_source_c
...
What I am trying to do is create a new column in this frame that contains a list (ideally a Python list as opposed to a command delimited string) of the sources when grouping by the fqdn value. For example, the data produced for the fqdn value site1.org should look like this based on this example data (this is just a subset of what I would expect, there should also be rows for the other fqdn values as well)
fqdn source_list source
site1.org [public_source_a, public_source_b, public_source_c] public_source_a
site1.org [public_source_a, public_source_b, public_source_c] public_source_b
site1.org [public_source_a, public_source_b, public_source_c] public_source_c
site1.org [public_source_a, public_source_b, public_source_c] public_source_d
Once I have the data in this form, I will simply drop the source column and then use drop_duplicates(keep='first') to get rid of all but one.
I dug up some old code that I used to do something similar about 2 years ago and it is not working as I expected it to. It's been quite a while since I've done something like this with Pandas. What I had was along the lines of:
merged_frame['source_list'] = merged_frame.groupby(
'fqdn', as_index=False)[['source']].aggregate(
lambda x: list(x))['source']
This is behaving very strangely. While it is in fact creating source_list as a list/array, the data in the column is not correct. Additionally, quite a few fqdn values have a null/NaN value for source_list
I have a feeling that this I need to approach this completely different. A little help with this would be appreciated, I'm completely blocked now and am not making any progress with it, despite having what I thought were very relevant example blocks of code I used on a similar dataset.
EDIT:
I have made a little progress by just starting with the fundamentals and have the following, though this joins the strings together rather than making them a list:
merged_frame['source_list'] = merged_frame.groupby('fqdn').source.transform(','.join)
I'm pretty sure with a simply apply here I can split them back into a list. But what would be the correct way to do this in one shot so that I don't need to do the unnecessary join and then apply(split(','))?
Create the data frame from the example above:
df=pd.DataFrame({'fqdn':['site1.org','site2.org','site3.org','site1.org','site4.org','site1.org','site4.org','site1.org'],\
'source':['public_source_a','public_source_a','public_source_a','public_source_b','public_source_b','public_source_b',\
'public_source_d','public_source_c']})
Use groupby and apply(list)
df_grouped=df.groupby('fqdn')['source'].unique().apply(list).reset_index()
Merge with original df and rename columns
result=pd.merge(df,df_grouped,on='fqdn',how='left')
result.rename(columns={'source_x':'source','source_y':'source_list'},inplace=True)
I am using Python 3.7
I need to load data from two different sources (both csv) and determine which rows from the one sources are not in the second source.
I have used pandas data-frames to load the data and do a comparison between the two sources of data.
I loaded the data from the csv file and a value like 2010392 is turned to 2010392.0 in the data-frame column.
I have read quite a number of articles about formatting data-frame columns; unfortunately, most of them are about date and time conversions.
I came across an article "Format integer column of Data-frame in Python pandas" at http://www.datasciencemadesimple.com/format-integer-column-of-dataframe-in-python-pandas/ which does not solve my problem
Based on the above mentioned article I have tried the following:
pd.to_numeric(data02['IDDLECT'], downcast='integer')
Out[63]:
0 2010392.0
1 111777967.0
2 2010392.0
3 2012554.0
4 2010392.0
5 2010392.0
6 2010392.0
7 1170126.0
and as you can see, the column values still have a decimal point with a zero.
I expect the load of the dataframe from a csv file to keep the format of a number such as 2010392 to be 2010392 and not 2010392.0
Here is the code that I have tried:
import pandas as pd
data = pd.read_csv("timetable_all_2019-2_groups.csv")
data02 = data.drop_duplicates()
print(f'Len data {len(data)}')
print(data.head(20))
print(f'Len data02 {len(data02)}')
print(data02.head(20))
pd.to_numeric(data02['IDDLECT'], downcast='integer')
Here is a few lines of the content of the csv file:
The data in the one source looks like this:
IDDCYR,IDDSUBJ,IDDOT,IDDGRPTYP,IDDCLASSGROUP,IDDLECT,IDDPRIMARY
019,AAACA1B,VF,C,A1,2010392,Y
2019,AAACA1B,VF,C,A1,111777967,N
2019,AAACA3B,VF,C,A1,2010392,Y
2019,AAACA3B,VF,C,A1,2012554,N
2019,AAACB2A,VF,C,B1,2010392,Y
2019,AAACB2A,VF,P,B2,2010392,Y
2019,AAACB2A,VF,C,B1,2010392,N
2019,AAACB2A,VF,P,B2,1170126,N
2019,AAACH1A,VF,C,A1,2010392,Y
Looks like you have data which is not of integer type. Once loaded you should do something about that data and then convert the column to int.
From your error description, you have nans and/or inf values. You could impute the missing values with the mode, mean, median or a constant value. You can achieve that either with pandas or with sklearn imputer, which is dedicated to imputing missing values.
Note that if you use mean, you may end up with a float number, so make sure to get the mean as an int.
The imputation method you choose really depends on what you'll use this data for later. If you want to understand the data, filling nans with 0 may destroy aggregation functions later (e.g. if you'll want to know what the mean is, it won't be accurate).
That being said, I see you're dealing with categorical data. One option here is to use dtype='category'. If you want to later fit a model with this and you leave ids as numbers, the model can conclude weird things which are not correct (e.g. the sum of two ids equals to some third id, or that ids that are higher are more important than lower ones... things that a priori make no sense and should not be ignored and left to chance.)
Hope this helps!
data02['IDDLECT'] = data02['IDDLECT']fillna(0).astype('int')
I'm new to Python and after a lot of tinkering, have managed to clean up some .csv data.
I now have a bunch of countries as rows and a bunch of dates as columns, and am trying to create a chart showing a line for each country's value over time.
The problem is that when I enter df.plot() it results in a chart with each date as a line.
I have melted the data such that the first column is country, second is date, and third is value, but all I get is a single blue block growing over time (not multiple lines). How can I fix this?
You can use the transpose function in [pandas][1]:
Or instead of df.plot, you can use plot(coloumn, row).
As it was mentioned in comments, it is always better to provide an example (look at #importanceofbeingeenest comment).
I have searched every possible solution but it never seems to create the plots in a way that is legible for me. It should also work for potentially 100's of dataframe columns so a solution being in a loop or something of that nature would be preferred
My dataframe is roughly this
data=
Time Pressure Static Temperature Stag Temperature
0 100 50 75
10 105 55 77
20 110 59 81
30 106 57 79
What I would like is 3 different graphs that plot Pressure, Static Temp, and Stag Temp vs Time which would be the X-axis.
My current code looks like
import pandas
data=pandas.read_csv(data.csv')
for header in data:
data.plot(x='System Time',y=header)
I think I understand the problem which is that for my data.plot needs to have y="Something in quotes" but I thought because header is a string it should work.
Any solution to get multiple graphs would be absolutely wonderful!
Also I apologize if my formatting is messed up as this is my first time posting!
I think you're looking for this:
>>> data.plot(x="Time")
However, to achieve this, I had to reformat your data.csv file to replace white spaces with commas, as it is the default separator in a Comma Separated Values file. Maybe your original file is tabulated and in this case, your need to specify sep='\t' to the read_csv() call.
If anyone finds this in the future, I figured out my own problem!
The problem was an error was thrown every time and all it said was
KeyError: 'Time'
This issue arose because 'Time' was my x axis and then it became my y axis through the iteration of "data". Thus is would stop every single time on the first loop.
To fix this, all I had to do was add a statement that skipped the column which was my x-axis
import pandas
data=pandas.read_csv(r'data.csv')
for header in data:
if header!="Time":
data.plot(x='Time',y=header,legend=False)
This skipped the first column and allowed the rest of the headers to be plotted in separate graphs.
If headers confuses you (like it confused me at first), you can use a more general form
import pandas
data=pandas.read_csv(r'data.csv')
for i in list(data):
if i!="Time":
data.plot(x='Time',y=i,legend=False)
Good luck everyone!
I am working with a CSV file and I need to find the greatest several items in a column. I was able to find the top value just by doing the standard looping through and comparing values.
My idea to get the top few values would be to either store all of the values from that column into an array, sort it, and then pull the last three indices. However I'm not sure if that would be a good idea in terms of efficiency. I also need to pull other attributes associated with the top value and it seems like separating out these column values would make everything messy.
Another thing that I thought about doing is having three variables and doing a running top value sort of deal, where every time I find something bigger I compare the "top three" amongst each other and reorder them. That also seems a bit complex and I'm not sure how I would implement it.
I would appreciate some ideas or if someone told if I'm missing something obvious. Let me know if you need to see my sample code (I felt it was probably unnecessary here).
Edit: To clarify, if the column values are something like [2,5,6,3,1,7] I would want to have the values first = 7, second = 6, third = 5
Pandas looks perfect for your task:
import pandas as pd
df = pd.read_csv('data.csv')
df.nlargest(3, 'column name')