Merge GoogleTrends Data Reports in Python - python
I'm quite new to Python and... well... let's say, not really an expert when it comes to coding. So apologies for the very amateurish question in advance. I'm trying to merge several googletrends report.csv files to use for my research.
Two problems I encounter:
The report files aren't just a spreadsheet but contain lots of other information that is irrelevant. I.e. I just want a certain array of each file to be merged (really just want the daily data containing the dates and the corresponding SVI for each month. Say: column 6 to 30)
As the (daily) data will be extracted from monthly report file and months do not have a constant number of days I cannot just use fixed column numbers to be read but would need those to be according to the number of days the specific months has.
Many thanks for the help!
Edit:
The code I use:
import pandas as pd
report = pd.read_csv('C:/Users/paul/Downloads/report.csv', skiprows=4, skipfooter=17)
print(report)
The output it produces
I managed to cut the first few lines off but I don't know how to cut off the bottom bit from row 31 onwards. So skipfooter didn't seem to work. But I can't use nrows as the months don't have the same number of days, so I won't know the number of rows in advance.
It turned out that it does help to occasionally read the warnings python gives.
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skip_footer; you can avoid this warning by specifying engine='python'.
The problem I had, that the skip_footer option didn't work, was apparently related to the c engine used.
For anyone running into the same issue, here's the code I solved it with:
import pandas as pd
report = pd.read_csv('C:/Users/paul/Downloads/report.csv', skiprows=4, skip_footer=27, engine='python')
print(report)
Just add engine='python' to get rid of the c engine problem. Don't ask me why I had to skip 27 rows in the end (I was pretty sure I counted 17), but with a bit of trial and error this just worked.
Related
Converting 0-1 values in dataset with the name of the column if the value of the cell is 1
I have a csv dataset with the values 0-1 for the features of the elements. I want to iterate each cell and replace the values 1 with the name of its column. There are more than 500 thousand rows and 200 columns and, because the table is exported from another annotation tool which I update often, I want to find a way in Python to do it automatically. This is not the table, but a sample test which I was using while trying to write a code I tried some, but without success. I would really appreciate it if you can share your knowledge with me. It will be a huge help. The final result I want to have is of the type: (abonojnë, token_pos_verb). If you know any method that I can do this in Excel without the help of Python, it would be even better. Thank you, Brikena Text,Comment,Role,ParentID,doc_completeness,lemma,MultiWord_Expr,token,pos,punctuation,verb,noun,adjective abonojnë,,,,,,,1,1,0,1,0,0 çokasin,,,,,,,1,1,0,1,0,1 gërgasin,,,,,,,1,1,0,1,0,0 godasin,,,,,,,1,1,0,1,0,0 përkasin,,,,,,,1,1,1,1,0,0 përdjegin,,,,,,,1,1,0,1,0,0 lakadredhin,,,,,,,1,1,0,1,1,0 përdredhin,,,,,,,1,1,0,1,0,0 spërdredhin,,,,,,,1,1,0,1,0,0 përmbledhin,,,,,,,1,1,0,1,0,0 shpërdredhin,,,,,,,1,1,0,1,0,0 arsejnë,,,,,,,1,1,0,1,1,0 çapëlejnë,,,,,,,1,1,0,1,0,0
Using pandas, this is quite easy: # pip install pandas import pandas as pd # read data (here example with csv, but use "read_excel" for excel) df = pd.read_csv('input.csv').set_index('Text') # reshape and export (df.mul(df.columns).where(df.eq(1)) .stack().rename('xxx') .groupby(level=0).apply('_'.join) ).to_csv('output.csv') # here use "to_excel" for excel format output file: Text,xxx abonojnë,token_pos_verb arsejnë,token_pos_verb_noun godasin,token_pos_verb gërgasin,token_pos_verb lakadredhin,token_pos_verb_noun përdjegin,token_pos_verb përdredhin,token_pos_verb përkasin,token_pos_punctuation_verb përmbledhin,token_pos_verb shpërdredhin,token_pos_verb spërdredhin,token_pos_verb çapëlejnë,token_pos_verb çokasin,token_pos_verb_adjective
An update to those who may find it helpful in the future. Thank you to #mozway for helping me. A friend of mine suggested working with Excel formula because the solution with Pandas and gropuby eliminates duplicates. Since I need all the duplicates, because it's an annotated corpus, it's normal that there are repeated words that should appear in every context, not only the first occurrence. The other alternative is this: Use a second sheet on the excel file, writing the formula =IF(Sheet1!B2=1,Sheet2!B$1,"") in the first cell with 0-1 values and drag it in all the other cells. This keeps all the occurrences of the words. It's quick and it works like magic. I hope this can be helpful to others who want to convert a 0-1 dataset to feature names without having to code.
Pandas dataframe to numpy array [duplicate]
This question already has answers here: Convert pandas dataframe to NumPy array (15 answers) Closed 3 years ago. I am very new to Python and have very little experience. I've managed to get some code working by copying and pasting and substituting the data I have, but I've been looking up how to select data from a dataframe but can't make sense of the examples and substitute my own data in. The overarching goal: (if anyone could actually help me write the entire thing, that would be helpful, but highly unlikely and probably not allowed) I am trying to use scipy to fit the curve of a temperature change when two chemicals react. There are 40 trials. The model I am hoping to use is a generalized logistic function with six parameters. All I need are the 40 functions, and nothing else. I have no idea how to achieve this, but I will ask another question when I get there. The current issue: I had imported 40 .csv files, compiled/shortened the data into 2 sections so that there are 20 trials in 1 file. Now the data has 21 columns and 63 rows. There is a title in the first row for each column, and the first column is a consistent time interval. However, each trial is not necessarily that long. One of them does, though. So I've managed to write the following code for a dataframe: import pandas as pd df = pd.read_csv("~/Truncated raw data hcl.csv") print(df) It prints the table out, but as expected, there are NaNs where there exists no data. So I would like to know how to arrange it into workable array with 2 columns , time and a trial like an (x,y) for a graph for future workings with numpy or scipy such that the rows that there is no data would not be included. Part of the .csv file begins after the horizontal line. I'm too lazy to put it in a code block, sorry. Thank you. time,1mnaoh trial 1,1mnaoh trial 2,1mnaoh trial 3,1mnaoh trial 4,2mnaoh trial 1,2mnaoh trial 2,2mnaoh trial 3,2mnaoh trial 4,3mnaoh trial 1,3mnaoh trial 2,3mnaoh trial 3,3mnaoh trial 4,4mnaoh trial 1,4mnaoh trial 2,4mnaoh trial 3,4mnaoh trial 4,5mnaoh trial 1,5mnaoh trial 2,5mnaoh trial 3,5mnaoh trial 4 0.0,23.2,23.1,23.1,23.8,23.1,23.1,23.3,22.0,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.3,24.3,24.1,24.1 0.5,23.2,23.1,23.1,23.8,23.1,23.1,23.3,22.1,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.4,24.3,24.1,24.1 1.0,23.2,23.1,23.1,23.7,23.1,23.1,23.3,22.3,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.5,24.3,24.1,24.1 1.5,23.2,23.1,23.1,23.7,23.1,23.1,23.3,22.4,22.8,23.4,23.3,24.0,23.0,23.8,23.8,23.9,23.6,24.3,24.1,24.1 2.0,23.3,23.2,23.2,24.2,23.6,23.2,24.3,22.5,23.0,23.7,24.4,24.1,23.1,23.9,24.4,24.2,23.7,24.5,24.7,25.1 2.5,24.0,23.5,23.5,25.4,25.3,23.3,26.4,22.7,23.5,25.8,27.9,25.1,23.1,23.9,27.4,26.8,23.8,27.2,26.7,28.1 3.0,25.4,24.4,24.1,26.5,27.8,23.3,28.5,22.8,24.6,28.6,31.2,27.2,23.2,23.9,30.9,30.5,23.9,31.4,29.8,31.3 3.5,26.9,25.5,25.1,27.4,29.9,23.4,30.1,22.9,26.4,31.4,34.0,30.0,23.3,24.2,33.8,34.0,23.9,35.1,33.2,34.4 4.0,27.8,26.5,26.2,27.9,31.4,23.4,31.3,23.1,28.8,34.0,36.1,32.6,23.3,26.6,36.0,36.7,24.0,37.7,35.9,36.8 4.5,28.5,27.3,27.0,28.2,32.6,23.5,32.3,23.1,31.2,36.0,37.5,34.8,23.4,30.0,37.7,38.7,24.0,39.7,38.0,38.7 5.0,28.9,27.9,27.7,28.5,33.4,23.5,33.1,23.2,33.2,37.6,38.6,36.5,23.4,33.2,39.0,40.2,24.0,40.9,39.6,40.2 5.5,29.2,28.2,28.3,28.9,34.0,23.5,33.7,23.3,35.0,38.7,39.4,37.9,23.5,35.6,39.9,41.2,24.0,41.9,40.7,41.0 6.0,29.4,28.5,28.6,29.1,34.4,24.9,34.2,23.3,36.4,39.6,40.0,38.9,23.5,37.3,40.6,42.0,24.1,42.5,41.6,41.2 6.5,29.5,28.8,28.9,29.3,34.7,27.0,34.6,23.3,37.6,40.4,40.4,39.7,23.5,38.7,41.1,42.5,24.1,43.1,42.3,41.7 7.0,29.6,29.0,29.1,29.5,34.9,28.8,34.8,23.5,38.6,40.9,40.8,40.2,23.5,39.7,41.4,42.9,24.1,43.4,42.8,42.3 7.5,29.7,29.2,29.2,29.6,35.1,30.5,35.0,24.9,39.3,41.4,41.1,40.6,23.6,40.5,41.7,43.2,24.0,43.7,43.1,42.9 8.0,29.8,29.3,29.3,29.7,35.2,31.8,35.2,26.9,40.0,41.6,41.3,40.9,23.6,41.1,42.0,43.4,24.2,43.8,43.3,43.3 8.5,29.8,29.4,29.4,29.8,35.3,32.8,35.4,28.9,40.5,41.8,41.4,41.2,23.6,41.6,42.2,43.5,27.0,43.9,43.5,43.6 9.0,29.9,29.5,29.5,29.9,35.4,33.6,35.5,30.5,40.8,41.8,41.6,41.4,23.6,41.9,42.4,43.7,30.8,44.0,43.6,43.8 9.5,29.9,29.6,29.5,30.0,35.5,34.2,35.6,31.7,41.0,41.8,41.7,41.5,23.6,42.2,42.5,43.7,33.9,44.0,43.7,44.0 10.0,30.0,29.7,29.6,30.0,35.5,34.6,35.7,32.7,41.1,41.9,41.8,41.7,23.6,42.4,42.6,43.8,36.2,44.0,43.7,44.1 10.5,30.0,29.7,29.6,30.1,35.6,35.0,35.7,33.3,41.2,41.9,41.8,41.8,23.6,42.6,42.6,43.8,37.9,44.0,43.8,44.2 11.0,30.0,29.7,29.6,30.1,35.7,35.2,35.8,33.8,41.3,41.9,41.9,41.8,24.0,42.9,42.7,43.8,39.3,,43.8,44.3 11.5,30.0,29.8,29.7,30.1,35.8,35.4,35.8,34.1,41.4,41.9,42.0,41.8,26.6,43.1,42.7,43.9,40.2,,43.8,44.3 12.0,30.0,29.8,29.7,30.1,35.8,35.5,35.9,34.3,41.4,42.0,42.0,41.9,30.3,43.3,42.7,43.9,40.9,,43.9,44.3 12.5,30.1,29.8,29.7,30.2,35.9,35.7,35.9,34.5,41.5,42.0,42.0,,33.4,43.4,42.7,44.0,41.4,,43.9,44.3 13.0,30.1,29.8,29.8,30.2,35.9,35.8,36.0,34.7,41.5,42.0,42.1,,35.8,43.5,42.7,44.0,41.8,,43.9,44.4 13.5,30.1,29.9,29.8,30.2,36.0,36.0,36.0,34.8,41.5,42.0,42.1,,37.7,43.5,42.8,44.1,42.0,,43.9,44.4 14.0,30.1,29.9,29.8,30.2,36.0,36.1,36.0,34.9,41.6,,42.2,,39.0,43.5,42.8,44.1,42.1,,,44.4 14.5,,29.9,29.8,,36.0,36.2,36.0,35.0,41.6,,42.2,,40.0,43.5,42.8,44.1,42.3,,,44.4 15.0,,29.9,,,36.0,36.3,,35.0,41.6,,42.2,,40.7,,42.8,44.1,42.4,,, 15.5,,,,,36.0,36.4,,35.1,41.6,,42.2,,41.3,,,,42.4,,,
To convert a whole DataFrame into a numpy array, use df = df.values() If i understood you correctly, you want seperate arrays for every trial though. This can be done like this: data = [df.iloc[:, [0, i]].values() for i in range(1, 20)] which will make a list of numpy arrays, every one containing the first column with temperature and one of the trial columns.
How to read_csv float value with correct decimal value in python using [duplicate]
I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places. When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002. This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away. How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed. Passing float_precision='round_trip' to read_csv fixes this. Check out this page for more detail on this. After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method. A full example: import pandas as pd df_in = pd.read_csv(source_file, float_precision='round_trip') df_out = ... # some processing of df_in df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else: I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me: In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') . Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.
Pandas returns a different word count than notepad++ and excel. Which one is correct?
I have a .csv file with 3 columns and 500.000+ lines. I'm trying to get insight into this dataset by counting occurences of certain tags. When i started i used Notepad++ count function for the tags i found and noted the results by hand. Now that i want to automate that process, i use pandas to do the same thing but the results differ quite a bit. Results for all tags summed up are: Notepad++ : 91.500 Excel : 91.677 Python.pandas : 91.034 Quite a difference, and i have no clue how to explain this and how to validate which result i can trust and use. My python code looks like this and is fully functional. #CSV.READ | Delimiter: ; | Datatype: string| Using only first 3 columns df = pd.read_csv("xxx.csv", sep=';', dtype="str") #fills nan with "Empty" to allow indexing df = df.fillna("Empty") #counts and sorts occurences of object(3) category occurences = df['object'].value_counts() #filter Columns with "Category:" tags_occurences = df[df['object'].str.contains("Category:")] #displays and tags_occurences2 = tags_occurences['object'].value_counts() Edit: Already iterated through the other columns, which result in finding another 120 tags, but there is still a discrepancy. In Excel and Notepad I just open Ctrl+F and search for "Category:" using their count functions. Has anyone made a similiar experience or can explain what might cause this ? Are excel & wordpad having errors while counting ? I can't imagine pandas (being used in ML and DataScience a lot) would have such flaws.
How to run an ADFuller test on timeseries data using statsmodels library?
I am completely new to programming languages and I have picked up Python for backtesting a trading strategy (because I heard it is relatively easy). I have made some progress in learning the basics, however I am currently stuck at performing an ADFuller tests on a timeseries dataframe. This is how my Dataframe looks Now I need to run ADF test on the columns - "A-Btd", "A- Ctd" and so on (I have 66 columns like these).I would like to get the test statistic/output for each of them. I tired using lines such as cadfs = [ts.adfuller(df1)]. Since, I lack the expertise I am not able to adjust the code as per my dataframe. I apologize in advance if I have missed out some important information I have to give out. Please leave a comment and I will provide it asap. Thanks a lot in advance!
If you have to do it for so many, I would try putting the results in a dict, like this: import statsmodels.tsa.stattools as tsa df = ... #load your dataframe adf_results = {} for col in df.columns.values: #or edit this for a subset of columns first adf_results[col] = tsa.adfuller(df[col]) Obviously specifying other settings as desired, e.g. tsa.adfuller(df[col], autolag='BIC'). Or if you don't want all the output and would rather just parse each column to find out if it's stationary or not, the test statistic is the first entry in the tuple returned by adfuller(), so you could just use tsa.adfuller(df[col])[0] and test it against your threshold to get a boolean result, then make that the value in your dict.