HDFStore start stop not working - python
Is it clear what I am doing wrong?
I'm experimenting with pandas HDFStore.select start and stop options and it's not making a difference.
The commands I'm using are:
import pandas as pd
hdf = pd.HDFStore(path % 'results')
len(hdf.select('results',start=15,stop=20))
hoping to get a length of 4 or 5 or however it's counted, but it gives me the whole darn dataframe.
Here is a screenshot:
When writing to the h5 file, select pandas.to_hdf(<path>,<key>,format='tables') which enables subsets of the store to be selected. However, this is a bug as you should get an error.
According to Jeff (https://stackoverflow.com/users/644898/jeff),
this is a known bug and has a fix here: github.com/pydata/pandas/issues/8287
Pull requests welcome.
Related
Pandas convert date
Hey guys I need some help, my date is not in the correct format. I made a function to convert all columns of dates it works but, it gives a return of SettingWithCopyWarning. (https://i.stack.imgur.com/5xlT4.png) (https://i.stack.imgur.com/hZe9f.png) (https://i.stack.imgur.com/iglZB.png) can you tell me how to solve this I've tried in several ways.
If your code is working and is doing its job you can always ignore the error by adding this at the top. I would not recommend it in a very large scale project. from pandas.core.common import SettingWithCopyWarning warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)
Pandas dataframe to numpy array [duplicate]
This question already has answers here: Convert pandas dataframe to NumPy array (15 answers) Closed 3 years ago. I am very new to Python and have very little experience. I've managed to get some code working by copying and pasting and substituting the data I have, but I've been looking up how to select data from a dataframe but can't make sense of the examples and substitute my own data in. The overarching goal: (if anyone could actually help me write the entire thing, that would be helpful, but highly unlikely and probably not allowed) I am trying to use scipy to fit the curve of a temperature change when two chemicals react. There are 40 trials. The model I am hoping to use is a generalized logistic function with six parameters. All I need are the 40 functions, and nothing else. I have no idea how to achieve this, but I will ask another question when I get there. The current issue: I had imported 40 .csv files, compiled/shortened the data into 2 sections so that there are 20 trials in 1 file. Now the data has 21 columns and 63 rows. There is a title in the first row for each column, and the first column is a consistent time interval. However, each trial is not necessarily that long. One of them does, though. So I've managed to write the following code for a dataframe: import pandas as pd df = pd.read_csv("~/Truncated raw data hcl.csv") print(df) It prints the table out, but as expected, there are NaNs where there exists no data. So I would like to know how to arrange it into workable array with 2 columns , time and a trial like an (x,y) for a graph for future workings with numpy or scipy such that the rows that there is no data would not be included. Part of the .csv file begins after the horizontal line. I'm too lazy to put it in a code block, sorry. Thank you. time,1mnaoh trial 1,1mnaoh trial 2,1mnaoh trial 3,1mnaoh trial 4,2mnaoh trial 1,2mnaoh trial 2,2mnaoh trial 3,2mnaoh trial 4,3mnaoh trial 1,3mnaoh trial 2,3mnaoh trial 3,3mnaoh trial 4,4mnaoh trial 1,4mnaoh trial 2,4mnaoh trial 3,4mnaoh trial 4,5mnaoh trial 1,5mnaoh trial 2,5mnaoh trial 3,5mnaoh trial 4 0.0,23.2,23.1,23.1,23.8,23.1,23.1,23.3,22.0,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.3,24.3,24.1,24.1 0.5,23.2,23.1,23.1,23.8,23.1,23.1,23.3,22.1,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.4,24.3,24.1,24.1 1.0,23.2,23.1,23.1,23.7,23.1,23.1,23.3,22.3,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.5,24.3,24.1,24.1 1.5,23.2,23.1,23.1,23.7,23.1,23.1,23.3,22.4,22.8,23.4,23.3,24.0,23.0,23.8,23.8,23.9,23.6,24.3,24.1,24.1 2.0,23.3,23.2,23.2,24.2,23.6,23.2,24.3,22.5,23.0,23.7,24.4,24.1,23.1,23.9,24.4,24.2,23.7,24.5,24.7,25.1 2.5,24.0,23.5,23.5,25.4,25.3,23.3,26.4,22.7,23.5,25.8,27.9,25.1,23.1,23.9,27.4,26.8,23.8,27.2,26.7,28.1 3.0,25.4,24.4,24.1,26.5,27.8,23.3,28.5,22.8,24.6,28.6,31.2,27.2,23.2,23.9,30.9,30.5,23.9,31.4,29.8,31.3 3.5,26.9,25.5,25.1,27.4,29.9,23.4,30.1,22.9,26.4,31.4,34.0,30.0,23.3,24.2,33.8,34.0,23.9,35.1,33.2,34.4 4.0,27.8,26.5,26.2,27.9,31.4,23.4,31.3,23.1,28.8,34.0,36.1,32.6,23.3,26.6,36.0,36.7,24.0,37.7,35.9,36.8 4.5,28.5,27.3,27.0,28.2,32.6,23.5,32.3,23.1,31.2,36.0,37.5,34.8,23.4,30.0,37.7,38.7,24.0,39.7,38.0,38.7 5.0,28.9,27.9,27.7,28.5,33.4,23.5,33.1,23.2,33.2,37.6,38.6,36.5,23.4,33.2,39.0,40.2,24.0,40.9,39.6,40.2 5.5,29.2,28.2,28.3,28.9,34.0,23.5,33.7,23.3,35.0,38.7,39.4,37.9,23.5,35.6,39.9,41.2,24.0,41.9,40.7,41.0 6.0,29.4,28.5,28.6,29.1,34.4,24.9,34.2,23.3,36.4,39.6,40.0,38.9,23.5,37.3,40.6,42.0,24.1,42.5,41.6,41.2 6.5,29.5,28.8,28.9,29.3,34.7,27.0,34.6,23.3,37.6,40.4,40.4,39.7,23.5,38.7,41.1,42.5,24.1,43.1,42.3,41.7 7.0,29.6,29.0,29.1,29.5,34.9,28.8,34.8,23.5,38.6,40.9,40.8,40.2,23.5,39.7,41.4,42.9,24.1,43.4,42.8,42.3 7.5,29.7,29.2,29.2,29.6,35.1,30.5,35.0,24.9,39.3,41.4,41.1,40.6,23.6,40.5,41.7,43.2,24.0,43.7,43.1,42.9 8.0,29.8,29.3,29.3,29.7,35.2,31.8,35.2,26.9,40.0,41.6,41.3,40.9,23.6,41.1,42.0,43.4,24.2,43.8,43.3,43.3 8.5,29.8,29.4,29.4,29.8,35.3,32.8,35.4,28.9,40.5,41.8,41.4,41.2,23.6,41.6,42.2,43.5,27.0,43.9,43.5,43.6 9.0,29.9,29.5,29.5,29.9,35.4,33.6,35.5,30.5,40.8,41.8,41.6,41.4,23.6,41.9,42.4,43.7,30.8,44.0,43.6,43.8 9.5,29.9,29.6,29.5,30.0,35.5,34.2,35.6,31.7,41.0,41.8,41.7,41.5,23.6,42.2,42.5,43.7,33.9,44.0,43.7,44.0 10.0,30.0,29.7,29.6,30.0,35.5,34.6,35.7,32.7,41.1,41.9,41.8,41.7,23.6,42.4,42.6,43.8,36.2,44.0,43.7,44.1 10.5,30.0,29.7,29.6,30.1,35.6,35.0,35.7,33.3,41.2,41.9,41.8,41.8,23.6,42.6,42.6,43.8,37.9,44.0,43.8,44.2 11.0,30.0,29.7,29.6,30.1,35.7,35.2,35.8,33.8,41.3,41.9,41.9,41.8,24.0,42.9,42.7,43.8,39.3,,43.8,44.3 11.5,30.0,29.8,29.7,30.1,35.8,35.4,35.8,34.1,41.4,41.9,42.0,41.8,26.6,43.1,42.7,43.9,40.2,,43.8,44.3 12.0,30.0,29.8,29.7,30.1,35.8,35.5,35.9,34.3,41.4,42.0,42.0,41.9,30.3,43.3,42.7,43.9,40.9,,43.9,44.3 12.5,30.1,29.8,29.7,30.2,35.9,35.7,35.9,34.5,41.5,42.0,42.0,,33.4,43.4,42.7,44.0,41.4,,43.9,44.3 13.0,30.1,29.8,29.8,30.2,35.9,35.8,36.0,34.7,41.5,42.0,42.1,,35.8,43.5,42.7,44.0,41.8,,43.9,44.4 13.5,30.1,29.9,29.8,30.2,36.0,36.0,36.0,34.8,41.5,42.0,42.1,,37.7,43.5,42.8,44.1,42.0,,43.9,44.4 14.0,30.1,29.9,29.8,30.2,36.0,36.1,36.0,34.9,41.6,,42.2,,39.0,43.5,42.8,44.1,42.1,,,44.4 14.5,,29.9,29.8,,36.0,36.2,36.0,35.0,41.6,,42.2,,40.0,43.5,42.8,44.1,42.3,,,44.4 15.0,,29.9,,,36.0,36.3,,35.0,41.6,,42.2,,40.7,,42.8,44.1,42.4,,, 15.5,,,,,36.0,36.4,,35.1,41.6,,42.2,,41.3,,,,42.4,,,
To convert a whole DataFrame into a numpy array, use df = df.values() If i understood you correctly, you want seperate arrays for every trial though. This can be done like this: data = [df.iloc[:, [0, i]].values() for i in range(1, 20)] which will make a list of numpy arrays, every one containing the first column with temperature and one of the trial columns.
How to read_csv float value with correct decimal value in python using [duplicate]
I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places. When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002. This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away. How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed. Passing float_precision='round_trip' to read_csv fixes this. Check out this page for more detail on this. After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method. A full example: import pandas as pd df_in = pd.read_csv(source_file, float_precision='round_trip') df_out = ... # some processing of df_in df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else: I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me: In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') . Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.
How to run an ADFuller test on timeseries data using statsmodels library?
I am completely new to programming languages and I have picked up Python for backtesting a trading strategy (because I heard it is relatively easy). I have made some progress in learning the basics, however I am currently stuck at performing an ADFuller tests on a timeseries dataframe. This is how my Dataframe looks Now I need to run ADF test on the columns - "A-Btd", "A- Ctd" and so on (I have 66 columns like these).I would like to get the test statistic/output for each of them. I tired using lines such as cadfs = [ts.adfuller(df1)]. Since, I lack the expertise I am not able to adjust the code as per my dataframe. I apologize in advance if I have missed out some important information I have to give out. Please leave a comment and I will provide it asap. Thanks a lot in advance!
If you have to do it for so many, I would try putting the results in a dict, like this: import statsmodels.tsa.stattools as tsa df = ... #load your dataframe adf_results = {} for col in df.columns.values: #or edit this for a subset of columns first adf_results[col] = tsa.adfuller(df[col]) Obviously specifying other settings as desired, e.g. tsa.adfuller(df[col], autolag='BIC'). Or if you don't want all the output and would rather just parse each column to find out if it's stationary or not, the test statistic is the first entry in the tuple returned by adfuller(), so you could just use tsa.adfuller(df[col])[0] and test it against your threshold to get a boolean result, then make that the value in your dict.
Merge GoogleTrends Data Reports in Python
I'm quite new to Python and... well... let's say, not really an expert when it comes to coding. So apologies for the very amateurish question in advance. I'm trying to merge several googletrends report.csv files to use for my research. Two problems I encounter: The report files aren't just a spreadsheet but contain lots of other information that is irrelevant. I.e. I just want a certain array of each file to be merged (really just want the daily data containing the dates and the corresponding SVI for each month. Say: column 6 to 30) As the (daily) data will be extracted from monthly report file and months do not have a constant number of days I cannot just use fixed column numbers to be read but would need those to be according to the number of days the specific months has. Many thanks for the help! Edit: The code I use: import pandas as pd report = pd.read_csv('C:/Users/paul/Downloads/report.csv', skiprows=4, skipfooter=17) print(report) The output it produces I managed to cut the first few lines off but I don't know how to cut off the bottom bit from row 31 onwards. So skipfooter didn't seem to work. But I can't use nrows as the months don't have the same number of days, so I won't know the number of rows in advance.
It turned out that it does help to occasionally read the warnings python gives. ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skip_footer; you can avoid this warning by specifying engine='python'. The problem I had, that the skip_footer option didn't work, was apparently related to the c engine used. For anyone running into the same issue, here's the code I solved it with: import pandas as pd report = pd.read_csv('C:/Users/paul/Downloads/report.csv', skiprows=4, skip_footer=27, engine='python') print(report) Just add engine='python' to get rid of the c engine problem. Don't ask me why I had to skip 27 rows in the end (I was pretty sure I counted 17), but with a bit of trial and error this just worked.